Supplementary Notebook S12: Exploring the ProteoForge Results from Combined 48 Hr & 72 Hr Look


Requirements:

This notebook requires the 02-ApplyingProteoForge_Comb.ipynb to be run first to generate the result files used here. Ensure the ./data/cleaned/ and .data/results/ directories contains the necessary input files.


Data Information:

  • Cell line: H358 (NSCLC)
  • Conditions: True Hypoxia (1% Oâ‚‚) at 48 Hr and 72 Hr, Normoxia (21% Oâ‚‚) at 48 Hr and 72 Hr
  • Input: test_data_Comb.feather, summary_data_Comb.feather, uniprot_data_Comb.feather
  • Output: Visualizations and summary statistics for identified proteoforms and some example proteins of interest

Purpose:

This notebook summarizes and visualizes the results from the combined ProteoForge analysis of hypoxia vs normoxia conditions at 48-hour and 72-hour timepoints. It includes protein-level differential proteoform (dPF) analysis, peptide composition distributions, UniProt annotation mapping, and case studies of biologically relevant proteoforms.


01. Setup

This section imports required libraries, configures display settings, and defines paths for data and figures.

Note: The HTML rendering of this notebook hides code cells by default. Click the "Code" buttons on the right to expand them.

01.1 Libraries

# Libraries Used
import os
import sys

import numpy as np # Numerical computing
import pandas as pd # Data manipulation

import seaborn as sns # R-like high-level plots
import matplotlib.pyplot as plt # Python's base plotting 

# home path
sys.path.append('../')
from src import utils, plots

from ProteoForge import annotate
from ProteoForge import proteoform_classifier as pfc

# Initialize the timer
startTime = utils.getTime()

01.2 Configure Notebook Settings

Configure visualization styles, color palettes, and display options for consistent output formatting throughout the analysis.

## Configure Plotting Settings
def_colors = [
    "#139593", "#fca311", "#e54f2a",
    "#c3c3c3", "#555555",
    "#690000", "#5f4a00", "#004549"
]

condition_colors = {
    'True Hypoxia (1% O2) - 48 Hr': '#e9c46a',
    'True Hypoxia (1% O2) - 72 Hr': '#f4a261',
    'Normoxia (21% O2) - 48 Hr': '#ade8f4',
    'Normoxia (21% O2) - 72 Hr': '#0077b6',
    # Won't be used since the manuscript didn't cover these conditions
    # 'Chemical Hypoxia (CoCl2) - 48 Hr': '#c77dff',
    # 'Chemical Hypoxia (CoCl2) - 72 Hr': '#8338ec',
    # 'Oxidative Stress (H2O2) - 48 Hr': '#ff006e',
    # 'Oxidative Stress (H2O2) - 72 Hr': '#fb5607',
}

# Set seaborn style
sns.set_theme(
    style="white",
    context="paper",
    palette=def_colors,
    font_scale=1,
    rc={
        "figure.figsize": (6, 4),
        "font.family": "sans-serif",
        "font.sans-serif": ["Arial", "Ubuntu Mono"],
    }
)

# Figure Saving Settings
figure_formats = ["png", "pdf"]
save_to_folder = True
transparent_bg = True
figure_dpi = 300

## Configure dataframe displaying
pd.set_option('display.float_format', lambda x: '%.4f' % x)
pd.set_option('display.max_rows', 50)
pd.set_option('display.max_columns', 50)
pd.set_option('display.width', 500)  # Set a wider display width


## Printing Settings
verbose = True

plots.color_palette( def_colors, save=False )
plots.color_palette( condition_colors, save=False, name='condition', size=2)

01.3 Data and Result Paths

  • home_path — Base project path (e.g., ./)
  • data_name — Dataset identifier used throughout ("hypoxia")
  • data_path — Root data folder ({home_path}data/)
  • input_path — Cleaned input data from preprocessing ({data_path}cleaned/{data_name}/)
  • output_path — ProteoForge result files ({data_path}results/{data_name}/)
  • figure_path — Destination for generated figures ({home_path}figures/{data_name}/03-ResultsAnalysis/)
  • notebook_name — This notebook name for logging and organizing outputs ("03-ResultsAnalysis")
home_path = './'
data_name = "hypoxia"
notebook_name = "03-ResultsAnalysis"
data_path = f"{home_path}data/"
input_path = f"{data_path}cleaned/{data_name}/"
output_path = f"{data_path}results/{data_name}/"
figure_path = f"{home_path}figures/{data_name}/{notebook_name}/"

# Create the output folder if it does not exist
if not os.path.exists(output_path):
    os.makedirs(output_path)

# Create figure folder structure, if needed
if save_to_folder:
    for i in figure_formats:
        cur_folder = figure_path + i + "/"
        if not os.path.exists(cur_folder):
            os.makedirs(cur_folder)

01.4 Load Input Data

Load the ProteoForge analysis outputs and supporting data including UniProt annotations with feature metadata, sample metadata with condition mappings, test data with peptide-level results, and summary data with protein-level statistics.

Uniprot Data: Contains expanded annotations for proteins in the data. (uniprot_data_Comb.feather)

uniprot_data = pd.read_feather(f"{output_path}uniprot_data_Comb.feather")
uniprot_data = annotate.add_feature_metadata(uniprot_data, annotate.feature_categories, annotate.relevant_features)
print(f"UniProt data loaded with shape: {uniprot_data.shape}")
uniprot_data.info()
uniprot_data
UniProt data loaded with shape: (774552, 12)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 774552 entries, 0 to 774551
Data columns (total 12 columns):
 #   Column            Non-Null Count   Dtype 
---  ------            --------------   ----- 
 0   isoform           774552 non-null  string
 1   Protein           774552 non-null  string
 2   feature           774552 non-null  string
 3   start             774552 non-null  Int64 
 4   end               774552 non-null  Int64 
 5   group             774552 non-null  string
 6   agent             774552 non-null  string
 7   note              774552 non-null  string
 8   description       774552 non-null  string
 9   score             774552 non-null  int64 
 10  is_relevant       774552 non-null  bool  
 11  feature_category  774552 non-null  object
dtypes: Int64(2), bool(1), int64(1), object(1), string(7)
memory usage: 67.2+ MB
isoform Protein feature start end group agent note description score is_relevant feature_category
0 A0A024RBG1 CHAIN 1 181 Diphosphoinositol polyphosphate phosphohydrola... 3 False Protein Processing & Maturation
1 A0A024RBG1 CROSSLNK 5 5 Ubiquitination/SUMOylation Ubiquitination/SUMOylation at K5 Source: psp; Type: UBIQUITINATION 1 True Co- & Post-Translational Modifications
2 A0A024RBG1 BINDING 10 10 3 False Functional Domains, Regions & Sites
3 A0A024RBG1 BINDING 18 20 3 False Functional Domains, Regions & Sites
4 A0A024RBG1 DOMAIN 18 145 Nudix hydrolase 3 False Functional Domains, Regions & Sites
... ... ... ... ... ... ... ... ... ... ... ... ...
774547 Q9Y6Y8 MOD_RES 930 930 Phosphorylation Phosphorylation at S930 Source: psp; Type: PHOSPHORYLATION 1 True Co- & Post-Translational Modifications
774548 Q9Y6Y8 CROSSLNK 931 931 Ubiquitination/SUMOylation Ubiquitination/SUMOylation at K931 Source: psp; Type: UBIQUITINATION 1 True Co- & Post-Translational Modifications
774549 Q9Y6Y8 MOD_RES 935 935 Phosphorylation Phosphorylation at Y935 Source: psp; Type: PHOSPHORYLATION 1 True Co- & Post-Translational Modifications
774550 Q9Y6Y8 CROSSLNK 938 938 Ubiquitination/SUMOylation Ubiquitination/SUMOylation at K938 Source: psp; Type: UBIQUITINATION 1 True Co- & Post-Translational Modifications
774551 Q9Y6Y8 CLEAVAGE 989 989 Proteolytic Cleavage trypsin 1 N-Ph cleavage at R989 by trypsin 1 MEROPS Cleavage Site; Protease: trypsin 1 (MER... 2 True Protein Processing & Maturation

774552 rows × 12 columns

Metadata: Sample metadata with condition mappings. (metadata.csv). Preview (first 5 rows):

meta = pd.read_csv(f"{input_path}metadata.csv")
# Build a dictionary to match condition_colors keys: 'Condition - Duration Hr'
condition_to_samples = {}
for condition in meta['Condition'].unique():
    for duration in sorted(meta.loc[meta['Condition'] == condition, 'Duration'].unique()):
        key = f"{condition} - {duration} Hr"
        samples = meta.loc[(meta['Condition'] == condition) & (meta['Duration'] == duration), 'Colname'].tolist()
        condition_to_samples[key] = samples

# samples_to_condition mapping
samples_to_condition = {sample: condition for condition, samples in condition_to_samples.items() for sample in samples}
# samples_to_colors mapping
samples_to_colors = {sample: condition_colors[condition] for sample, condition in samples_to_condition.items()}
meta.head(5)
filename Run Condition Replicate Duration Colname
0 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_1_48_... H358_1%_1_48_DIA_BB4_1_5432 True Hypoxia (1% O2) 1 48 1%_48_1
1 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_2_48_... H358_1%_2_48_DIA_BC4_1_5440 True Hypoxia (1% O2) 2 48 1%_48_2
2 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_3_48_... H358_1%_3_48_DIA_BD4_1_5448 True Hypoxia (1% O2) 3 48 1%_48_3
3 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_4_48_... H358_1%_4_48_DIA_BE4_1_5456 True Hypoxia (1% O2) 4 48 1%_48_4
4 D:\Data\Tam\RevisionLungDIA\H358\H358_1%_2_72_... H358_1%_2_72_DIA_BC8_1_5444 True Hypoxia (1% O2) 2 72 1%_72_2

Setting Up the Data

Here we load all the result files test_data_Comb.feather, summary_data_Comb.feather, and info_data.feather generated from the combined ProteoForge analysis in the previous notebook. As well as the protein_data.feather which will help with protein level informations (expanded Fasta).

Peptide Regulation Across Conditions

One thing that we can look is the regulation compared to control condition (Normoxia 21% O2 at 72hr). Which will be Up/Down compared to control for three other conditions. This is not the most intuitive way to look at the data but it will give us a better idea about individual peptides regulation across conditions.

pThr = 10**-3

test_data = pd.read_feather(f"{output_path}test_data_Comb.feather")
print(test_data.shape)
print()


summary_data = pd.read_feather(f"{output_path}summary_data_Comb.feather")
print(summary_data.shape)
print()

info_data = pd.read_feather(f"{input_path}info_data.feather")
print(info_data.shape)
print()

summary_data['trace'] = info_data['trace']

protein_data = pd.read_feather(f"{input_path}protein_data.feather")
print(protein_data.shape)
print()


## Adding Hypoxia Regulation compared to Normoxia 

df =  test_data.copy()

# --- 2. Define Control and Treatment Conditions ---
control_condition = 'Normoxia (21% O2) - 72 Hr'
# Sorting ensures the regulation string (e.g., 'Up_Down') is always in a consistent order.
treatment_conditions = sorted([cond for cond in df['Condition'].unique() if cond != control_condition])

print(f"Control: {control_condition}")
print(f"Treatments (in order): {treatment_conditions}")

# --- 3. Calculate Mean Intensity for Each Peptide in Each Condition ---
# This aggregates the technical replicates for each condition.
# The mean of AdjIntensity for a treatment is its log2FC vs. control.
# Using .agg is slightly more explicit and can be faster than .mean()
mean_intensities = df.groupby(['Protein', 'PeptideID', 'Condition'], as_index=False).agg(log2FC=('AdjIntensity', 'mean'))


# --- 4. Create the Regulation Table ---
# We can filter out the control condition, as its regulation is baseline (0).
regulation_df = mean_intensities[mean_intensities['Condition'] != control_condition].copy()

# Calculate the linear Fold Change from the log2FC.
# FoldChange = 2^(log2FC)
regulation_df['FoldChange'] = 2**regulation_df['log2FC']


# --- 5. Categorize Regulation Tendencies (Optimized) ---
# To categorize tendencies, we first pivot the table to have conditions as columns.
# This allows us to see the behavior of one peptide across all conditions in a single row.
pivot_reg = regulation_df.pivot_table(
    index=['Protein', 'PeptideID'],
    columns='Condition',
    values='log2FC'
).fillna(0) # Fill peptides not found in a condition with 0 log2FC (no change)

# This vectorized approach is much faster than using .apply()
# It builds the regulation string (e.g., 'Up_Down') column by column.
regulation_strings = None
for i, cond in enumerate(treatment_conditions):
    # Use np.select for fast conditional categorization based on direction
    categories = np.select(
        [pivot_reg[cond] > 0, pivot_reg[cond] < 0],
        ['Up', 'Down'],
        default='NC' # No Change
    )
    
    # Convert to a pandas Series to align with the pivot_reg index
    categories_series = pd.Series(categories, index=pivot_reg.index)

    if i == 0:
        regulation_strings = categories_series
    else:
        # Append the category for the next condition
        regulation_strings += '_' + categories_series

pivot_reg['Regulation'] = regulation_strings


# Now, merge the 'Regulation' column back into our long-format table
# This adds the summary string to each row corresponding to a given peptide.
regulation_df = regulation_df.merge(
    pivot_reg[['Regulation']],
    on=['Protein', 'PeptideID'],
    how='left'
)

# Reorder columns for a clean final output
regulation_df = regulation_df[['Protein', 'PeptideID', 'Condition', 'log2FC', 'FoldChange', 'Regulation']]


print("\n--- Final Regulation Table with Tendencies ---")
print(regulation_df.head())

# You can now save this final dataframe to a new file, for example:
# regulation_df.to_csv('regulation_analysis_output.csv', index=False)

# Example of how to use the new 'Regulation' column:
print("\n--- Example: Grouping by Regulation Tendency ---")
print(regulation_df.groupby('Regulation')['PeptideID'].nunique())

plot_data = regulation_df[['Protein', 'PeptideID', 'Regulation']].drop_duplicates().copy()

# Adding the plot_data to summary_data to get the Regulation column (use hypoxia 48hr-72hr)
summary_data = summary_data.merge(
    plot_data[['Protein', 'PeptideID', 'Regulation']],
    on=['Protein', 'PeptideID'],
    how='left'
)
summary_data['Regulation'] = summary_data['Regulation'].fillna('NC_NC')  # peptides not in regulation_df are 'No Change' in both conditions
summary_data.rename(columns={'Regulation': 'Regulation_(Hypoxia 48hr-72hr)'}, inplace=True)

fig, ax = plt.subplots(figsize=(8, 6))
plot_data['Regulation'].value_counts().plot(kind='barh', color="#432c46", edgecolor='black', ax=ax)
# Add counts to on top of the bars for barh
for i, v in enumerate(plot_data['Regulation'].value_counts()):
    ax.text(
        v, i, str(v),
        color='black', 
        ha='left', va='center', 
        rotation=0, fontsize=15, fontweight='bold'
    )
ax.set_xlabel("Number of Peptides", fontsize=12)
ax.set_ylabel("Regulation Tendency (Hypoxia 48Hr-72Hr)", fontsize=12)
ax.set_title("Peptide Regulation Tendencies Across Conditions", fontsize=14, fontweight='bold')
ax.grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
plt.tight_layout()

plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename="peptide_regulation_tendencies",
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

print()
summary_data['Regulation_(Hypoxia 48hr-72hr)'].value_counts()
(1335166, 24)

(95369, 17)

(95369, 15)

(7161, 8)

Control: Normoxia (21% O2) - 72 Hr
Treatments (in order): ['Normoxia (21% O2) - 48 Hr', 'True Hypoxia (1% O2) - 48 Hr', 'True Hypoxia (1% O2) - 72 Hr']

--- Final Regulation Table with Tendencies ---
      Protein  PeptideID                     Condition  log2FC  FoldChange      Regulation
0  A0A024RBG1          1     Normoxia (21% O2) - 48 Hr -0.0347      0.9762  Down_Down_Down
1  A0A024RBG1          1  True Hypoxia (1% O2) - 48 Hr -0.0674      0.9544  Down_Down_Down
2  A0A024RBG1          1  True Hypoxia (1% O2) - 72 Hr -0.1384      0.9085  Down_Down_Down
3  A0A024RBG1          2     Normoxia (21% O2) - 48 Hr  0.0226      1.0158    Up_Down_Down
4  A0A024RBG1          2  True Hypoxia (1% O2) - 48 Hr -0.0310      0.9787    Up_Down_Down

--- Example: Grouping by Regulation Tendency ---
Regulation
Down_Down_Down    188
Down_Down_Up      188
Down_Up_Down      126
Down_Up_Up        231
Up_Down_Down       89
Up_Down_Up        119
Up_Up_Down        170
Up_Up_Up          353
Name: PeptideID, dtype: int64




Regulation_(Hypoxia 48hr-72hr)
Down_Down_Down    37985
Down_Down_Up      17746
Up_Up_Up          13447
Up_Up_Down         8265
Down_Up_Up         6597
Up_Down_Down       4310
Up_Down_Up         3716
Down_Up_Down       3303
Name: count, dtype: int64

Majority of the peptides tendencie is lower than of the control condition. (Down_Down_Down). With All Up and Hypoxia 72hr Up/Down while the 48hr ones being Down/Up make up the most frequent 4 patterns.

Note: This is an early indicator that conditions have a strong effect on peptide regulation, which will be further explored at the proteoform level in the next section. But also shows the when number of conditions increase the complexity of the data increases exponentially which makes it hard to interpret, even manually or visually.


02. Protein-Level Analysis

This section generates protein-level summaries and visualizations of dPF discovery, including coverage relationships, peptide composition distributions, and proteoform category breakdowns.

# Create protein-level summary with dPF categories (vectorized, fast pandas)
# Uses indicator columns and a single groupby.agg to avoid per-group Python callbacks.

# local alias
df = summary_data.copy()

# Ensure numeric/bool-like columns use compact dtypes
if df['isSignificant'].dtype != 'int32' and df['isSignificant'].dtype != 'int64':
    df['isSignificant'] = df['isSignificant'].fillna(False).astype('int32')

# Temporary indicator columns (fast to sum)
df['is_canonical'] = (df['dPF'] == 0).astype('int32')
df['is_ptm'] = (df['dPF'] == -1).astype('int32')
df['is_proteoform'] = (df['dPF'] > 0).astype('int32')

# Perform one grouped aggregation (all C-level operations)
protein_dpf_analysis = (
    df.groupby('Protein', sort=False, observed=True)
      .agg(
          Gene = ('Gene', 'first'),
          Length = ('Length', 'first'),
          Coverage = ('Coverage', 'first'),
          all_pep = ('isSignificant', 'count'),
          sgnf_pep = ('isSignificant', 'sum'),
          all_clust = ('ClusterID', 'max'),
          n_forms = ('dPF', 'nunique'),
          n_canon = ('is_canonical', 'sum'),
          n_ptm = ('is_ptm', 'sum'),
          n_dPF = ('is_proteoform', 'sum'),
          max_dpf = ('dPF', 'max'),
      )
)

# Convert max_dpf -> max_proteoform_id: only positive dPFs indicate proteoform ids
protein_dpf_analysis['n_forms'] = (
    protein_dpf_analysis['max_dpf']
      .fillna(0)
      .where(protein_dpf_analysis['max_dpf'] > 0, 0)
      .astype('int32')
)
protein_dpf_analysis.drop(columns=['max_dpf'], inplace=True)

# If other code expects a flattened/explicit column order, optionally reorder here.
# Cleanup temporary indicator columns to free memory
try:
    df.drop(columns=['is_canonical', 'is_ptm', 'is_proteoform'], inplace=True)
except Exception:
    pass

# Add population rates for peptides, canonical, PTM, and proteoforms
protein_dpf_analysis['sgnf_pct'] = (
    protein_dpf_analysis['sgnf_pep'] / protein_dpf_analysis['all_pep']
).fillna(0).astype('float32') * 100.0 
protein_dpf_analysis['canon_pct'] = (
    protein_dpf_analysis['n_canon'] / protein_dpf_analysis['all_pep']
).fillna(0).astype('float32') * 100.0 
protein_dpf_analysis['ptm_pct'] = (
    protein_dpf_analysis['n_ptm'] / protein_dpf_analysis['all_pep']
).fillna(0).astype('float32') * 100.0 
protein_dpf_analysis['dPF_pct'] = (
    protein_dpf_analysis['n_dPF'] / protein_dpf_analysis['all_pep']
).fillna(0).astype('float32') * 100.0 
protein_dpf_analysis['eff_pct'] = (
    protein_dpf_analysis['n_forms'] / protein_dpf_analysis['all_clust']
).fillna(0).astype('float32') * 100.0

# Categorize coverage
protein_dpf_analysis['cov_cat'] = pd.cut(
    protein_dpf_analysis['Coverage'],
    bins=[-1, 25, 50, 75, 100],
    labels=['Low', 'Medium', 'High', 'V.High'],
    include_lowest=True
)
# Categorize available significant peptides per protein
protein_dpf_analysis['sgnf_cat'] = pd.cut(
    protein_dpf_analysis['sgnf_pep'],
    bins=[-1, 0, 1, np.inf],
    labels=['None', 'Single', 'Multiple'],
    right=True
)
# Categorize protein length
protein_dpf_analysis['len_cat'] = pd.cut(
    protein_dpf_analysis['Length'],
    bins=[-1, 150, 250, 750, 1000, 2000, np.inf],
    labels=['V.Short', 'Short', 'Medium', 'Long', 'V.Long', 'Huge'],
    right=True
)


# Categorize dPFs and PTMs containment per protein
labels = ['No Forms', 'PTM Only', 'dPF Only', 'Both']

conds = [
    (protein_dpf_analysis['n_forms'] == 0) & (protein_dpf_analysis['n_ptm'] == 0),
    (protein_dpf_analysis['n_forms'] == 0) & (protein_dpf_analysis['n_ptm'] > 0),
    (protein_dpf_analysis['n_forms'] > 0) & (protein_dpf_analysis['n_ptm'] == 0),
    (protein_dpf_analysis['n_forms'] > 0) & (protein_dpf_analysis['n_ptm'] > 0),
]

protein_dpf_analysis['dpf_cat'] = pd.Categorical(
    np.select(conds, labels, default='No Forms'),
    categories=labels
)

# protein_dpf_analysis is now ready with ptm_count, proteoform_count, max_proteoform_id, etc.
protein_dpf_analysis = protein_dpf_analysis.reset_index()
protein_dpf_analysis
Protein Gene Length Coverage all_pep sgnf_pep all_clust n_forms n_canon n_ptm n_dPF sgnf_pct canon_pct ptm_pct dPF_pct eff_pct cov_cat sgnf_cat len_cat dpf_cat
0 A0A024RBG1 NUDT4B 181 30.3867 5 0 2 0 5 0 0 0.0000 100.0000 0.0000 0.0000 0.0000 Medium None Short No Forms
1 A0A0B4J2D5 GATD3B 268 35.4478 7 0 3 0 7 0 0 0.0000 100.0000 0.0000 0.0000 0.0000 Medium None Medium No Forms
2 A0A1B0GTU1 ZC3H11B 805 8.1988 5 1 2 0 4 1 0 20.0000 80.0000 20.0000 0.0000 0.0000 Low Single Long PTM Only
3 A0A1W2PQL4 ZNF722 384 11.7188 5 1 2 0 4 1 0 20.0000 80.0000 20.0000 0.0000 0.0000 Low Single Medium PTM Only
4 A0A2R8Y4L2 HNRNPA1L3 275 44.7273 12 1 2 0 11 1 0 8.3333 91.6667 8.3333 0.0000 0.0000 Medium Single Medium PTM Only
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
7156 Q9Y6X5 ENPP4 453 19.8675 7 1 2 0 6 1 0 14.2857 85.7143 14.2857 0.0000 0.0000 Low Single Medium PTM Only
7157 Q9Y6X8 ZHX2 837 33.0944 22 2 2 1 2 0 20 9.0909 9.0909 0.0000 90.9091 50.0000 Medium Multiple Long dPF Only
7158 Q9Y6X9 MORC2 1032 27.2287 23 2 4 1 20 1 2 8.6957 86.9565 4.3478 8.6957 25.0000 Medium Multiple V.Long Both
7159 Q9Y6Y0 IVNS1ABP 642 11.0592 5 0 3 0 5 0 0 0.0000 100.0000 0.0000 0.0000 0.0000 Low None Medium No Forms
7160 Q9Y6Y8 SEC23IP 1000 22.7000 17 2 2 1 15 0 2 11.7647 88.2353 0.0000 11.7647 50.0000 Low Multiple Long dPF Only

7161 rows × 20 columns

Protein-level dPF analysis table constructed with counts of canonical peptides, PTMs, and proteoforms per protein. Categories include coverage percentages, significance rates, and proteoform efficiency metrics. Essentially giving a overall summary of how many peptides with ptms, or apperant dPFs exists per protein along with their percentage of prevalance.

Column Definitions:

  • all_pep: Total number of peptides observed for the protein (count of peptide rows).
  • sgnf_pep: Number of peptides called significant (sum of isSignificant).
  • all_clust: Max ClusterID for the protein (proxy for number of peptide clusters; use unique count of ClusterID if IDs are non-contiguous).
  • n_forms: Number of proteoforms for the protein (positive dPF values).
  • n_canon: Count of canonical peptides (dPF == 0).
  • n_ptm: Count of PTM-only peptides (dPF == -1).
  • n_dPF: Count of peptides assigned to proteoforms (dPF > 0).
  • sgnf_pct: Percentage of peptides that are significant: (sgnf_pep / all_pep) * 100.
  • canon_pct: Percentage of canonical peptides: (n_canon / all_pep) * 100.
  • ptm_pct: Percentage of PTM-only peptides: (n_ptm / all_pep) * 100.
  • dPF_pct: Percentage of peptides assigned to proteoforms: (n_dPF / all_pep) * 100.
  • eff_pct: Proteoform discovery efficiency: (n_forms / all_clust) * 100 — fraction of clusters yielding proteoforms.
  • cov_cat: Coverage category based on Coverage (Low, Medium, High, V.High).
  • sgnf_cat: Significant-peptide category based on sgnf_pep (None, Single, Multiple).
  • len_cat: Length category derived from Length (V.Short, Short, Medium, Long, V.Long, Huge).
  • dpf_cat: Proteoform/PTM containment category: No Forms, PTM Only, dPF Only, or Both.

Notes:

  • dPF conventions: dPF == 0 → canonical peptide; dPF == -1 → PTM-only; dPF > 0 → proteoform ID.
  • Percent columns are float percentages in range 0–100.
  • all_clust uses max(ClusterID) as the cluster count proxy; if your clustering IDs are not contiguous starting at 1, use an explicit unique count of ClusterID if you want exact cluster counts.

02.1 Significance by Coverage and Proteoforms

Heatmap showing the relationship between protein coverage categories (Low, Medium, High, V.High) and number of proteoforms discovered per protein (max=4). Average significance percentage indicates how coverage and proteoform count relate to detection of significant peptides within proteins. Average significance (sgnf_pct) calculated as (sgnf_pep / all_pep) * 100.

# Heatmap of average sgnf_pct by coverage category vs number of proteoforms (n_forms)
heatmap_data = protein_dpf_analysis.pivot_table(
    values='sgnf_pct',
    index='cov_cat',
    columns='n_forms',
    aggfunc='mean',
    observed=True,
)

# Ensure sensible ordering of coverage categories if categorical
if isinstance(protein_dpf_analysis['cov_cat'].dtype, pd.CategoricalDtype):
    heatmap_data = heatmap_data.reindex(protein_dpf_analysis['cov_cat'].cat.categories)

# Sort columns (n_forms) ascending
heatmap_data = heatmap_data.reindex(sorted(heatmap_data.columns), axis=1)

plt.figure(figsize=(10, 6))
sns.heatmap(
    heatmap_data,
    annot=True,
    fmt=".1f",
    cmap="YlGnBu",
    cbar_kws={'label': 'Average sgnf_pct (%)'},
    linewidths=0.5,
    linecolor='white',
    mask=heatmap_data.isna()
)
plt.title('Average Significance Percentage (sgnf_pct) by Coverage Category vs Number of Proteoforms')
plt.ylabel('Coverage Category')
plt.xlabel('Number of Proteoforms (n_forms)')
plt.tight_layout()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='heatmap_sgnf_pct_cov_cat_vs_n_forms',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

Unsurprisingly, proteins with higher coverage tend to have a greater proportion of significant peptides. But surprisingly the highest dPF counts 4 is only detected in proteins with Low and Medium coverage, which likely indicates high coverage proteins are dominated by canonical and PTM peptides rather than diverse proteoforms, or that number of amino-acids and directly the number of peptides is smaller in High and V.High coverage proteins. Making them easier to cover fully with fewer proteoforms.


02.2 Peptide Composition by Coverage

Stacked area plot showing the distribution of peptide types (canonical, PTM, and proteoform) across proteins sorted by coverage. The smoothed coverage trend line reveals how peptide composition varies with detection depth.

# Plot Composition of Proteins by dPF Categories
# This plot shows the distribution of canonical, PTM, and proteoform peptides across proteins
# with a smoothed coverage trend line.
custom_colors = {
    'Canonical': "#e5e5e5",      # Gray - baseline
    'PTM': "#a7c957",            # Red - modifications  
    'Proteoform': "#bc4749",     # Blue - true proteoforms
}

plot_data = protein_dpf_analysis.copy()
proteins_sorted = plot_data.sort_values('Coverage').reset_index(drop=True)
n_proteins = len(proteins_sorted)

# If many proteins, aggregate into bins (keeps trend readable)
# tune: lower -> more smoothing, higher -> more detail
max_bins = 100 
if n_proteins > max_bins:
    # group into evenly spaced bins
    proteins_sorted['bin'] = pd.cut(np.arange(n_proteins), bins=max_bins, labels=False)
    agg = proteins_sorted.groupby('bin').agg({
        'canon_pct':'mean',
        'ptm_pct':'mean',
        'dPF_pct':'mean',
        'Coverage':'mean'
    }).reset_index(drop=True)
    x = np.arange(len(agg))
    canonical = agg['canon_pct'].values
    ptm = agg['ptm_pct'].values
    proteoform = agg['dPF_pct'].values
    coverage = agg['Coverage'].values
    xtick_locs = np.linspace(0, len(agg)-1, 5).astype(int)
    xtick_labels = [f"{int(i/len(agg)*100)}%" for i in xtick_locs]  # percent position
else:
    x = np.arange(n_proteins)
    canonical = proteins_sorted['canon_pct'].values
    ptm = proteins_sorted['ptm_pct'].values
    proteoform = proteins_sorted['dPF_pct'].values
    coverage = proteins_sorted['Coverage'].values
    xtick_locs = np.linspace(0, n_proteins-1, 5).astype(int)
    xtick_labels = [f"{int(i/ n_proteins * 100)}%" for i in xtick_locs]

# Compute overall means for legend
mean_c = canonical.mean()
mean_ptm = ptm.mean()
mean_pf = proteoform.mean()

fig, ax = plt.subplots(figsize=(12, 5))
# stackplot handles the stacking cleanly
ax.stackplot(
    x,
    canonical, ptm, proteoform,
    labels=[f"Canonical ({mean_c:.1f}%)",
            f"PTMs ({mean_ptm:.1f}%)",
            f"Proteoforms ({mean_pf:.1f}%)"],
    colors=[custom_colors['Canonical'], custom_colors['PTM'], custom_colors['Proteoform']],
    alpha=0.85
)

# Coverage trend on twin axis (smoothed)
ax2 = ax.twinx()
window = max(3, int(len(coverage) * 0.02))  # ~2% window, min 3
coverage_smooth = pd.Series(coverage).rolling(window=window, center=True, min_periods=1).mean().values
ax2.plot(x, coverage_smooth, color='k', linewidth=1.6, alpha=0.9, label='Coverage (smoothed)')
ax2.set_ylim(0, 100)
ax.set_xlim(x.min(), x.max())

# Tidy up ticks/labels
ax.set_xlabel(f'{plot_data.shape[0]} Proteins (sorted by coverage)', fontsize=12, fontweight='bold')
ax.set_ylabel('Peptide composition (%)', fontsize=12, fontweight='bold')
ax2.set_ylabel('Coverage (%)', fontsize=12, fontweight='bold')
ax.set_xticks(xtick_locs)
ax.set_xticklabels(xtick_labels)
ax.set_ylim(0, max( (canonical + ptm + proteoform).max(), 100))  # sensible y-limits

# Legend and annotation
leg = ax.legend(
    loc='lower right', fontsize=10, frameon=False, bbox_to_anchor=(1.0, 0.1), 
    title='Peptide Composition', title_fontsize='12', ncol=1
)
ax2.legend(loc='lower right', fontsize=10, frameon=False, bbox_to_anchor=(1.0, 0))

# Add light grid and reduce spines clutter
ax.grid(axis='y', linestyle='--', linewidth=0.6, alpha=0.75, color='gray')

plt.tight_layout()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='peptideComposition_proteinCoverage',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

Most peptides across all proteins are canonical, with PTM and proteoform peptides comprising smaller fractions. However, as protein coverage increases, the relative proportion of canonical peptides slightly goes doewn, while Proteoforms contribution increases. Looking between PTM and Proteoform, lower coverage tends to have a higher fraction of PTM peptides compared to Proteoforms, while at higher coverage levels Proteoforms make up a larger share than PTMs.


02.3 Proteoform Discovery Landscape

Bubble plot visualizing the relationship between peptide count and proteoform discovery per protein. Bubble size represents protein length and color indicates coverage, revealing patterns in how protein characteristics influence proteoform detection.

# Plot the Efficiency of Proteoform Discovery
# X: Number of Peptides per Protein
# Y: Number of Proteoforms Discovered (non-)
# Size: Protein Length (bubble size)
# Color: Coverage
# Alpha: Coverage

from matplotlib.lines import Line2D

plot_data = protein_dpf_analysis.copy()
# plot_data['discovery_efficiency'] = plot_data['n_dPF'] / plot_data['all_pep'] # dPF_pct
# plot_data['significance_efficiency'] = plot_data['n_sgnf'] / plot_data['all_pep'] # sgnf_pct
plot_data['complexity_score'] = (plot_data['n_forms'] * plot_data['all_clust']) / (plot_data['Length'] / 100)
plot_data['complexity_score'] = plot_data['complexity_score'].fillna(0)

# Bubble size scaling (marker area)
length = plot_data['Length'].clip(lower=1)
min_area, max_area = 40, 1600
areas = np.interp(length, (length.min(), length.max()), (min_area, max_area))

# Coverage -> color and per-point alpha
coverage = plot_data['Coverage'].fillna(0).astype(float)
# cov_min, cov_max = coverage.min(), coverage.max()
cov_min, cov_max = 0, 100  # assuming coverage is in percentage
if cov_max == cov_min:
    cov_max = cov_min + 1e-6  # avoid zero-range

# map coverage to alpha range (low coverage -> lower alpha)
alpha_min, alpha_max = 0.30, 0.95
alphas = np.interp(coverage, (cov_min, cov_max), (alpha_min, alpha_max))

# build RGBA colors from a cmap, then set per-point alpha
cmap = plt.cm.RdYlBu_r
norm = plt.Normalize(vmin=cov_min, vmax=cov_max)
base_colors = cmap(norm(coverage))  # Nx4 array
base_colors[:, 3] = alphas         # set alpha channel per-point

# Plot
x = plot_data['all_pep'].values
y = plot_data['n_forms'].values

fig, ax = plt.subplots(figsize=(10, 6))

sc = ax.scatter(
    x, y,
    s=areas,
    c=base_colors,
    edgecolors='white',
    linewidth=0.8,
    zorder=2
)

# Colorbar for Coverage (note: alpha not reflected in bar)
mappable = plt.cm.ScalarMappable(norm=norm, cmap=cmap)
mappable.set_array(coverage)
cbar = fig.colorbar(mappable, ax=ax, pad=0.02, shrink=0.9)
cbar.set_label('Coverage (%)', fontsize=11, fontweight='bold')

# Size legend (approximate protein lengths)
lens = np.percentile(length, [10, 50, 90])
proxy_areas = np.interp(lens, (length.min(), length.max()), (min_area, max_area))
legend_proxies = [
    Line2D([0], [0], marker='o', color='w', label=f'Len ≈ {int(lens[0])}', markerfacecolor='gray', markersize=np.sqrt(proxy_areas[0])),
    Line2D([0], [0], marker='o', color='w', label=f'Len ≈ {int(lens[1])}', markerfacecolor='gray', markersize=np.sqrt(proxy_areas[1])),
    Line2D([0], [0], marker='o', color='w', label=f'Len ≈ {int(lens[2])}', markerfacecolor='gray', markersize=np.sqrt(proxy_areas[2]))
]

# Legends: exceptional + size proxies
main_handles, main_labels = ax.get_legend_handles_labels()
# ax.legend(loc='upper left', fontsize=10)
ax2 = ax.legend(handles=legend_proxies, title='Protein length (approx)', loc='upper right', fontsize=9, title_fontsize=10, frameon=False)
ax.add_artist(ax2)

# Labels and title
ax.set_xlabel('Number of Peptides per Protein', fontsize=12, fontweight='bold')
ax.set_ylabel('Number of Proteoforms Discovered', fontsize=12, fontweight='bold')
ax.set_title(
    'Proteoform Discovery Landscape\n(Size ~ Protein length, Color = Coverage; alpha ~ Coverage)',
    fontsize=14, fontweight='bold'
)
ax.grid(True, which="both", linestyle="--", linewidth=0.5, alpha=0.7)

plt.tight_layout()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='proteoform_discovery_efficiency',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

Similar to the previous visualization (heatmap), this shows the number of proteoforms discovered tends to increase with peptide count, but with diminishing returns at higher counts. Larger proteins (bigger bubbles) and those with higher coverage (warmer colors) generally yield more proteoforms, indicating that both protein size and detection depth facilitate proteoform discovery.


02.4 Discovery Efficiency by Cluster Count

Combined visualization showing proteoform category composition (stacked bars) and efficiency metrics (line plots) across cluster counts. This reveals how clustering behavior relates to proteoform discovery rates.

# Revised layout: stacked bar (top) + percentage lines + efficiency (bottom) with shared x-axis
import matplotlib.patches as mpatches

plot_data = protein_dpf_analysis.copy()

# 1. Prepare data (reuse same ordering)
composition_data = pd.crosstab(plot_data['all_clust'], plot_data['dpf_cat'])
category_order = ['dPF Only', 'PTM Only', 'No Forms', 'Both']
composition_data = composition_data.reindex(columns=category_order, fill_value=0).sort_index()
totals = composition_data.sum(axis=1).astype(int)
x_coords = np.arange(len(composition_data))

# colors in same order as columns
col_color_map = {"No Forms":"#e5e5e5","PTM Only":"#a7c957","dPF Only":"#bc4749","Both":"#967aa1"}
bar_colors = [col_color_map[c] for c in composition_data.columns]

# 2. Create stacked bar (top) and line/CI plot (bottom) sharing X
fig, (ax_bar, ax_line) = plt.subplots(2, 1, figsize=(12, 8), sharex=True, gridspec_kw={'height_ratios':[1, 1]})

# Top: stacked bars
composition_data.plot(
    kind='bar',
    stacked=True,
    ax=ax_bar,
    color=bar_colors,
    width=0.8,
    legend=False
)
ax_bar.set_ylabel('Number of Proteins', fontsize=12)
ax_bar.set_title('Composition of Proteoform Discoveries by Cluster Count', fontsize=16, pad=12)
ax_bar.grid(True, which="both", linestyle="--", linewidth=0.5, alpha=0.7)
ax_bar.set_ylim(0, int(totals.max() * 1.12))

# totals above bars
y_offset = max(2, totals.max() * 0.02)
for i, total in enumerate(totals.values):
    if total > 0:
        ax_bar.text(i, total + y_offset, f"{total:,}", ha='center', va='bottom', fontsize=10, fontweight='bold')

# tidy x tick labels on the bottom axis
ax_line.set_xlabel('Number of Clusters (all_clust)', fontsize=12)
ax_line.set_xticks(x_coords)
ax_line.set_xticklabels([str(x) for x in composition_data.index], rotation=0)

# 3. Bottom: percentage lines per category with 95% CI (binomial) + efficiency mean with CI (t-approx)
# Percentage per category (p) and 95% CI using binomial approximation: CI = 1.96 * sqrt(p*(1-p)/N)
for col, color in zip(composition_data.columns, bar_colors):
    counts = composition_data[col].values.astype(float)
    N = totals.values.astype(float)
    # avoid division by zero
    p = np.divide(counts, N, out=np.zeros_like(counts, dtype=float), where=N!=0)
    sem_binom = np.sqrt(np.divide(p * (1 - p), np.where(N == 0, 1, N)))
    ci95 = sem_binom * 1.96
    ax_line.errorbar(
        x_coords, p * 100,
        yerr=ci95 * 100,
        marker='o', linewidth=4, color=color, capsize=8, label=f"{col} (%)", 
        linestyle='-', markersize=8, alpha=0.8, elinewidth=2,
        markeredgecolor='black', capthick=2,
    )

# Efficiency stats by cluster: mean and 95% CI from group std / sqrt(n)
eff_stats = (
    plot_data.reset_index()
             .groupby('all_clust')['eff_pct']
             .agg(['mean', 'std', 'count'])
             .reindex(composition_data.index)
)
eff_mean = eff_stats['mean'].fillna(0)
eff_std = eff_stats['std'].fillna(0)
eff_count = eff_stats['count'].fillna(0).replace(0, np.nan)  # avoid div-by-zero
eff_sem = eff_std / np.sqrt(eff_count)
eff_ci95 = (eff_sem * 1.96).fillna(0)

# Plot efficiency mean and CI as emphasized black line + shaded band
ax_line.plot(x_coords, eff_mean.values, marker='s', color='k', linewidth=2.5, markersize=7, label='Avg Efficiency (eff_pct)')
ax_line.fill_between(x_coords, (eff_mean - eff_ci95).values, (eff_mean + eff_ci95).values, color='k', alpha=0.12)

# 4. Styling & legends
ax_line.set_ylabel('Percentage / Efficiency (%)', fontsize=12)
ax_line.set_ylim(0, 100)
ax_line.grid(True, which="both", linestyle="--", linewidth=0.5, alpha=0.7)

# Bar legend (from patches) and line legend
bar_handles = [mpatches.Patch(color=col_color_map[c], label=c) for c in composition_data.columns]
leg1 = ax_bar.legend(handles=bar_handles, title='Proteoform Category', bbox_to_anchor=(1.02, 1), loc='upper left', frameon=False)

leg2 = ax_line.legend(loc='upper center', bbox_to_anchor=(0.5, -0.18), ncol=5, frameon=False, title='Percent lines / Efficiency')

# keep bar legend visible
ax_bar.add_artist(leg1)

plt.tight_layout()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='proteoform_discovery_efficiency_by_cluster',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

This plot directly looks at the number of clusters not just dPF or PTMs. The number of clusters doesn't really affect much the proteoform discovery efficiency or the proteoform categories. With higher number of clusters percentage of dPFs decreases as well as PTM percentage, however at 9 clusters (which is only one protein) PTM is the only category observed.

Note: This plot is not the most informative, however it does give some insight into how clustering relates to proteoform discovery. More clusters do not necessarily lead to higher proteoform detection, indicating that cluster quality and peptide characteristics may play larger roles.

Note: Another thing that I can take from this is the clustering is the main bottleneck for proteoform discovery rather than the actual proteoform detection itself. As the number of clusters increase the proteoform discovery efficiency and percentages decrease which indicates that either the clustering is not optimal or the peptides themselves are not informative enough to distinguish proteoforms.


02.5 Significant Peptide Distribution

Donut chart showing the distribution of proteins by their number of statistically significant peptides from the linear model: none, single, or multiple significant peptides per protein.

# Plot donut chart for proteins with no signf, single signf, multiple signf
plot_data = protein_dpf_analysis[[ 'Gene', 'sgnf_pep' ]].copy()

# Categorize proteins
labels = ['None', 'Single', 'Multiple']
plot_data['sgnf_cat'] = pd.cut(
    plot_data['sgnf_pep'],
    bins=[-1, 0, 1, np.inf],
    labels=labels,
    right=True
)

# Ordered counts and percentages
counts = plot_data['sgnf_cat'].value_counts().reindex(labels, fill_value=0)
total = counts.sum()
percent = counts / total

# slightly separate very small slices
explode = [0.02 if p < 0.05 else 0.0 for p in percent]

# Use a list of color strings (in the same order as `labels`) instead of a dict.
# colors_list = ['#c3c3c3', '#fca311', '#e54f2a'] # Autumn palette
colors_list = ["#e5e5e5", "#669bbc", "#003049"] # Blue-Gray palette

fig, ax = plt.subplots(figsize=(8, 6))
wedges, texts = ax.pie(
    counts.values,
    labels=None,             
    startangle=140,
    colors=colors_list,
    wedgeprops=dict(width=0.40, edgecolor='w'),
    explode=explode,
    counterclock=False
)

# Add percent labels placed on wedges
bbox_props = dict(boxstyle="round,pad=0.3", fc="white", ec="none", alpha=0.8)
for w, p in zip(wedges, percent):
    angle = (w.theta2 + w.theta1) / 2.0
    x, y = w.r * 0.75 * np.cos(np.deg2rad(angle)), w.r * 0.75 * np.sin(np.deg2rad(angle))
    ax.text(x, y, f"{p:.1%}", ha='center', va='center', fontsize=15, bbox=bbox_props, fontweight='bold')

# Center annotation: total proteins
ax.text(0, 0, f"Total\n{total}", ha='center', va='center', fontsize=15, fontweight='bold')

# Legend with counts and percentages
legend_labels = [f"{lab}: {cnt:,} ({pct:.1%})" for lab, cnt, pct in zip(labels, counts, percent)]
ax.legend(
    wedges, legend_labels,
    title="Number of Significant Peptides",
    loc="upper center",
    bbox_to_anchor=(0.5, -0.12),
    ncol=len(labels),
    fontsize=11,
    title_fontsize=12,
    frameon=False, 
    handletextpad=0.5,
    columnspacing=1.5,
)

ax.axis('equal')
plt.title('Distribution of Proteins by Number of \nSignificant Peptides from Linear Model', fontsize=16, fontweight='bold')
plt.tight_layout()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='significantPeptides_perProtein_Donut',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

This figure simply calculates the proportion of significant peptides detected per protein with splitting single or multiple significant peptides into their own categories. Shows ~31% of proteins without any significant peptides. And most have at least one significant peptide detected.


02.6 Proteoform Category Distribution

Donut chart showing the overall distribution of proteins across proteoform categories: no forms detected, PTM-only, dPF-only, or both PTMs and dPFs identified.

# Plot donut chart for proteins with no additional forms, only dPFs, only PTMs, with Both
plot_data = protein_dpf_analysis[[ 'Gene', 'n_forms', 'n_canon', 'n_ptm', 'n_dPF' ]].copy()

without_forms = set(plot_data[(plot_data['n_forms'] == 0)].index)
with_PTM_only = set(plot_data[(plot_data['n_forms'] == 0) & (plot_data['n_ptm'] > 0)].index)
with_dPF_only = set(plot_data[(plot_data['n_forms'] > 0) & (plot_data['n_ptm'] == 0)].index)
with_both = set(plot_data[(plot_data['n_forms'] > 0) & (plot_data['n_ptm'] > 0)].index)
without_forms = without_forms.difference(with_PTM_only)
print(f"Proteins with no additional forms: {len(without_forms):,}")
print(f"Proteins with only PTMs: {len(with_PTM_only):,}")
print(f"Proteins with only dPFs: {len(with_dPF_only):,}")
print(f"Proteins with both PTMs and dPFs: {len(with_both):,}")

# Categorize proteins
labels = ['No Forms', 'PTM Only', 'dPF Only', 'Both']
counts = [
    len(without_forms),
    len(with_PTM_only),
    len(with_dPF_only),
    len(with_both)
]
# Calculate percentages
percent = np.array(counts) / sum(counts)

# slightly separate very small slices
explode = [0.02 if p < 0.05 else 0.0 for p in percent]
# Use a list of color strings (in the same order as `labels`) instead of a dict.
colors_list = ["#e5e5e5", "#a7c957", "#bc4749", "#967aa1"]

fig, ax = plt.subplots(figsize=(8, 6))
wedges, texts = ax.pie(
    counts,
    labels=None,             
    startangle=140,
    colors=colors_list,
    wedgeprops=dict(width=0.40, edgecolor='w'),
    explode=explode,
    counterclock=False
)
# Add percent labels placed on wedges
bbox_props = dict(boxstyle="round,pad=0.3", fc="white", ec="none", alpha=0.8)
for w, p in zip(wedges, percent):
    angle = (w.theta2 + w.theta1) / 2.0
    x, y = w.r * 0.75 * np.cos(np.deg2rad(angle)), w.r * 0.75 * np.sin(np.deg2rad(angle))
    ax.text(x, y, f"{p:.1%}", ha='center', va='center', fontsize=15, bbox=bbox_props, fontweight='bold')

# Center annotation: total proteins
ax.text(0, 0, f"Total\n{sum(counts)}", ha='center', va='center', fontsize=15, fontweight='bold')

# Legend with counts and percentages
legend_labels = [f"{lab}: {cnt:,} ({pct:.1%})" for lab, cnt, pct in zip(labels, counts, percent)]
ax.legend(
    wedges, legend_labels,
    title="Number of Proteins",
    loc="upper center",
    bbox_to_anchor=(0.5, -0.12),
    ncol=len(labels),
    fontsize=11,
    title_fontsize=12,
    frameon=False, 
    handletextpad=0.5,
    columnspacing=1.5,
)
ax.axis('equal')
plt.title('Distribution of Proteins by Additional Forms and PTMs', fontsize=16, fontweight='bold')
plt.tight_layout()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='formCategories_perProtein_Donut',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)
Proteins with no additional forms: 2,206
Proteins with only PTMs: 3,765
Proteins with only dPFs: 739
Proteins with both PTMs and dPFs: 451

Moving on to the categorization of the signficant peptides, there can be two things they can be. A signficant peptide can either doesn't cluster with any other peptides becomes a PTM, or it can cluster with other peptides (either significant or not) and become part of a dPF. This chart shows the overall distribution of proteins across these categories. Added "Both" which includes proteins that have both PTM-only peptides and peptides assigned to dPFs. Not surprising that most significant peptides are PTM-only.


03. Peptide-UniProt Annotation Mapping

Now that we have an idea about the discovery of peptides as PTMs and/or dPFs, we can move on to mapping these peptides to UniProt annotations to understand their functional implications. This is the most crucial step to connect the peptide-level findings to biological context.

03.1 Build Enhanced Annotation Map

Create a comprehensive mapping between identified peptides and UniProt annotations, including miscleavage detection. This connects peptide-level ProteoForge results with functional annotation data for biological interpretation.

def create_enhanced_peptide_uniprot_map_with_miscleavage(
    summary_data: pd.DataFrame,
    uniprot_data: pd.DataFrame,
    protein_dpf_analysis: pd.DataFrame,
    overlap_threshold: float = 0.5,
    verbose: bool = False
) -> pd.DataFrame:
    """
    Create an enhanced peptide-UniProt annotation mapping with miscleavage detection.

    Parameters
    ----------
    summary_data : pd.DataFrame
        Peptide summary data with columns: Protein, PeptideID, peptide_start, peptide_end, dPF.
    uniprot_data : pd.DataFrame
        UniProt annotation data with columns: Protein, feature_category, feature, start, end, group, agent, note, score, is_relevant.
    protein_dpf_analysis : pd.DataFrame
        Protein dPF analysis with at least columns: Protein, dpf_cat.
    overlap_threshold : float, optional
        Fractional overlap threshold for miscleavage detection (default is 0.5).
    verbose : bool, optional
        If True, print detailed progress and statistics.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns:
        Protein, PeptideID, pepStart, pepEnd, startpos, endpos, feature_category, feature,
        group, agent, note, score, dPF.
    """
    import time
    t0 = time.time()
    if verbose:
        print("=== Enhanced Peptide-UniProt Mapping with Miscleavage Detection ===")

    # Filter proteins of interest
    proteins = protein_dpf_analysis.loc[
        protein_dpf_analysis['dpf_cat'] != 'No Forms', 'Protein'
    ].unique()
    if verbose:
        print(f"  Proteins to check: {len(proteins):,}")

    # Prepare peptides
    peptides = (
        summary_data[summary_data['Protein'].isin(proteins)]
        .rename(columns={'peptide_start': 'pepStart', 'peptide_end': 'pepEnd'})
        [['Protein', 'PeptideID', 'pepStart', 'pepEnd', 'dPF']]
        .reset_index(drop=True)
    )
    if verbose:
        print(f"  Peptides to process: {len(peptides):,}")

    # Prepare relevant UniProt annotations
    annotations = (
        uniprot_data[
            (uniprot_data['is_relevant']) &
            (uniprot_data['Protein'].isin(proteins))
        ]
        .rename(columns={'start': 'ann_start', 'end': 'ann_end'})
        [['Protein', 'feature_category', 'feature', 'ann_start', 'ann_end', 'group', 'agent', 'note', 'score']]
        .reset_index(drop=True)
    )
    if verbose:
        print(f"  Relevant UniProt annotations: {len(annotations):,}")

    # Find peptide-annotation overlaps
    merged = peptides.merge(annotations, on='Protein', how='left')
    overlap_mask = (
        ((merged['ann_start'] >= merged['pepStart']) & (merged['ann_start'] <= merged['pepEnd'])) |
        ((merged['ann_end'] >= merged['pepStart']) & (merged['ann_end'] <= merged['pepEnd'])) |
        ((merged['ann_start'] <= merged['pepStart']) & (merged['ann_end'] >= merged['pepEnd']))
    )
    ann_overlap = merged[overlap_mask].copy()

    # Build annotation result
    result_ann = pd.DataFrame()
    if not ann_overlap.empty:
        result_ann = pd.DataFrame({
            'Protein': ann_overlap['Protein'],
            'PeptideID': ann_overlap['PeptideID'],
            'pepStart': ann_overlap['pepStart'],
            'pepEnd': ann_overlap['pepEnd'],
            'startpos': ann_overlap['ann_start'],
            'endpos': ann_overlap['ann_end'],
            'feature_category': ann_overlap['feature_category'],
            'feature': ann_overlap['feature'],
            'group': ann_overlap['group'],
            'agent': ann_overlap['agent'],
            'note': ann_overlap['note'],
            'score': ann_overlap['score'],
            'dPF': ann_overlap['dPF']
        })

    # Miscleavage/peptide-peptide overlap detection
    if verbose:
        print("  Detecting miscleavage/peptide-peptide overlaps...")

    pep2pep = peptides.merge(peptides, on='Protein', suffixes=('', '_other'))
    pep2pep = pep2pep[pep2pep['PeptideID'] != pep2pep['PeptideID_other']]
    pep2pep['pep_length'] = pep2pep['pepEnd'] - pep2pep['pepStart'] + 1
    pep2pep['overlap_start'] = np.maximum(pep2pep['pepStart'], pep2pep['pepStart_other'])
    pep2pep['overlap_end'] = np.minimum(pep2pep['pepEnd'], pep2pep['pepEnd_other'])
    actual_overlap = pep2pep[pep2pep['overlap_start'] <= pep2pep['overlap_end']].copy()
    actual_overlap['overlap_length'] = actual_overlap['overlap_end'] - actual_overlap['overlap_start'] + 1
    actual_overlap['overlap_fraction'] = actual_overlap['overlap_length'] / actual_overlap['pep_length']
    sig_overlap = actual_overlap[actual_overlap['overlap_fraction'] > overlap_threshold].copy()
    sig_overlap['overlap_type'] = np.where(sig_overlap['overlap_fraction'] > 0.8, 'Miscleavage', 'Partial_Overlap')

    result_mis = pd.DataFrame()
    if not sig_overlap.empty:
        result_mis = pd.DataFrame({
            'Protein': sig_overlap['Protein'],
            'PeptideID': sig_overlap['PeptideID'],
            'pepStart': sig_overlap['pepStart'],
            'pepEnd': sig_overlap['pepEnd'],
            'startpos': sig_overlap['overlap_start'],
            'endpos': sig_overlap['overlap_end'],
            'feature_category': 'Cleavage Analysis',
            'feature': 'MISCLEAVAGE',
            'group': sig_overlap['overlap_type'],
            'agent': 'ProteoForge_Analysis',
            'note': 'Overlap with PeptideID ' + sig_overlap['PeptideID_other'].astype(str) +
                    ' (fraction: ' + sig_overlap['overlap_fraction'].round(3).astype(str) + ')',
            'score': sig_overlap['overlap_fraction'],
            'dPF': sig_overlap['dPF']
        })

    # Peptides with no annotation or miscleavage
    all_pep_ids = set(zip(peptides['Protein'], peptides['PeptideID']))
    annotated_ids = set()
    if not result_ann.empty:
        annotated_ids.update(zip(result_ann['Protein'], result_ann['PeptideID']))
    if not result_mis.empty:
        annotated_ids.update(zip(result_mis['Protein'], result_mis['PeptideID']))
    no_annot_ids = all_pep_ids - annotated_ids

    result_none = pd.DataFrame()
    if no_annot_ids:
        no_annot_df = peptides[
            peptides.set_index(['Protein', 'PeptideID']).index.isin(no_annot_ids)
        ].copy()
        result_none = pd.DataFrame({
            'Protein': no_annot_df['Protein'],
            'PeptideID': no_annot_df['PeptideID'],
            'pepStart': no_annot_df['pepStart'],
            'pepEnd': no_annot_df['pepEnd'],
            'startpos': -1,
            'endpos': -1,
            'feature_category': '',
            'feature': 'No Annotation',
            'group': '',
            'agent': '',
            'note': '',
            'score': 0.0,
            'dPF': no_annot_df['dPF']
        })

    # Combine all results
    result = pd.concat([df for df in [result_ann, result_mis, result_none] if not df.empty], ignore_index=True)

    # Standardize dtypes
    dtype_map = {
        'Protein': 'string',
        'PeptideID': 'int32',
        'pepStart': 'int32',
        'pepEnd': 'int32',
        'startpos': 'int32',
        'endpos': 'int32',
        'feature_category': 'string',
        'feature': 'string',
        'group': 'string',
        'agent': 'string',
        'note': 'string',
        'score': 'float32',
        'dPF': 'int8'
    }
    for col, typ in dtype_map.items():
        if col in result.columns:
            result[col] = result[col].astype(typ)
    string_cols = ['feature_category', 'group', 'agent', 'note']
    result[string_cols] = result[string_cols].fillna('')

    # Verbose summary
    if verbose:
        t1 = time.time()
        miscleavage_count = (result['feature'] == 'MISCLEAVAGE').sum()
        annotation_count = (result['feature'] != 'MISCLEAVAGE').sum()
        no_annotation_count = (result['feature'] == 'No Annotation').sum()
        print(f"\n=== Mapping Summary ===")
        print(f"  Total entries: {len(result):,}")
        print(f"    - UniProt annotations: {annotation_count - no_annotation_count:,}")
        print(f"    - Miscleavage features: {miscleavage_count:,}")
        print(f"    - No annotation: {no_annotation_count:,}")
        print(f"  Processing time: {t1-t0:.2f} seconds")

    return result[[
        'Protein', 'PeptideID', 'pepStart', 'pepEnd', 
        'feature_category', 'feature', 'startpos', 'endpos', 
        'group', 'agent', 'note', 'score', 'dPF'
    ]].sort_values(['Protein', 'PeptideID', 'feature', 'startpos']).reset_index(drop=True)

peptide_uniprot_map = create_enhanced_peptide_uniprot_map_with_miscleavage(
    summary_data, uniprot_data, protein_dpf_analysis, 
    overlap_threshold=0.5, verbose=True
)
print()
print("Sample of Peptide-UniProt Mapping with Miscleavage Detection:")
peptide_uniprot_map.head(10)
=== Enhanced Peptide-UniProt Mapping with Miscleavage Detection ===
  Proteins to check: 4,955
  Peptides to process: 76,834
  Relevant UniProt annotations: 431,860
  Detecting miscleavage/peptide-peptide overlaps...

=== Mapping Summary ===
  Total entries: 240,608
    - UniProt annotations: 213,542
    - Miscleavage features: 9,595
    - No annotation: 17,471
  Processing time: 2.13 seconds

Sample of Peptide-UniProt Mapping with Miscleavage Detection:
Protein PeptideID pepStart pepEnd feature_category feature startpos endpos group agent note score dPF
0 A0A1B0GTU1 1 89 99 No Annotation -1 -1 0.0000 0
1 A0A1B0GTU1 2 115 124 No Annotation -1 -1 0.0000 0
2 A0A1B0GTU1 3 179 193 No Annotation -1 -1 0.0000 -1
3 A0A1B0GTU1 4 714 724 No Annotation -1 -1 0.0000 0
4 A0A1B0GTU1 5 758 776 Co- & Post-Translational Modifications CROSSLNK 765 765 Ubiquitination/SUMOylation Ubiquitination/SUMOylation at K765 1.0000 0
5 A0A1W2PQL4 1 43 50 No Annotation -1 -1 0.0000 0
6 A0A1W2PQL4 2 234 242 No Annotation -1 -1 0.0000 0
7 A0A1W2PQL4 3 290 298 No Annotation -1 -1 0.0000 -1
8 A0A1W2PQL4 4 346 354 No Annotation -1 -1 0.0000 0
9 A0A1W2PQL4 5 373 382 No Annotation -1 -1 0.0000 0

Mapped about 250K annotations-peptides to generate a full Peptide-UniProt annotation map. As it is scene if a peptide doesn't have any annotation it is still kept in the map with empty annotations for feature_category, group, agent, etc. Additionally information such as score, dPF, and position are also kept for further analysis.

This step is critical and I wanted to provide as much information from as many angles/sources into a single mapping table to facilitate downstream analyses linking peptide-level findings to functional annotations. However that you will see in upcoming parts, some are too complicated to automatically calculate so that I had to count manually for some of them.


04. PTM and Proteoform Characterization

04.1 PTM Annotation Analysis

Analyze peptides classified as PTM-only (dPF = -1) by examining their overlap with UniProt annotations. This reveals which PTM types are most commonly associated with single-peptide discordance patterns.

# Filter for only PTM peptides (dPF == -1)
final_res = peptide_uniprot_map[
    (peptide_uniprot_map['dPF'] == -1)
]

# Number of Proteins with No Annotation
no_annot_proteins = final_res[final_res['feature'] == 'No Annotation']['Protein'].unique()
with_annotation_proteins = final_res[final_res['feature'] != 'No Annotation']['Protein'].unique()
print(f"Number of Proteins with No Annotation: {len(no_annot_proteins)}")
print(f"Number of Proteins with Annotation: {len(with_annotation_proteins)}")

# Create a donut chart
data = [len(no_annot_proteins), len(with_annotation_proteins)]
# Labels
labels = ["No Annotation", "With Annotation"]
# Colors
colors = ["#e5e5e5", "#a7c957"]

fig, ax = plt.subplots(figsize=(8, 6))
wedges, texts = ax.pie(
    data,
    labels=None,             
    startangle=140,
    colors=colors,
    wedgeprops=dict(width=0.60, edgecolor='w'),
    explode=[0.02, 0.0],  # separate the first slice slightly
    counterclock=False
)
# Add percent labels placed on wedges
bbox_props = dict(boxstyle="round,pad=0.3", fc="white", ec="none", alpha=0.8)
for w, p in zip(wedges, np.array(data) / sum(data)):
    angle = (w.theta2 + w.theta1) / 2.0
    x, y = w.r * 0.75 * np.cos(np.deg2rad(angle)), w.r * 0.75 * np.sin(np.deg2rad(angle))
    ax.text(x, y, f"{p:.1%}", ha='center', va='center', fontsize=15, bbox=bbox_props, fontweight='bold')
# Center annotation: total proteins
ax.text(0, 0, f"Total\n{sum(data)}", ha='center', va='center', fontsize=15, fontweight='bold')
# Legend with counts and percentages
legend_labels = [f"{lab}: {cnt:,} ({pct:.1%})" for lab, cnt, pct in zip(labels, data, np.array(data) / sum(data))]
ax.legend(
    wedges, legend_labels,
    title="Number of Proteins",
    loc="upper center",
    bbox_to_anchor=(0.5, -0.12),
    ncol=len(labels),
    fontsize=11,
    title_fontsize=12,
    frameon=False, 
    handletextpad=0.5,
    columnspacing=1.5,
)
ax.axis('equal')
plt.title('Distribution of Proteins with PTM Peptides\nAnnotation Availability', fontsize=16, fontweight='bold')
plt.tight_layout()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='proteins_withPTM_peptides_UniprotAnnotation_Donut',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

print()

# Create a visualization to show the Number of Proteins with UniProt Annotations
fig, ax = plt.subplots(figsize=(5, 5))
sub_data = final_res[final_res['feature'] != 'No Annotation']

# Create a bar plot
sub_data['feature'].value_counts().plot(kind='barh', color="#a7c957", ax=ax, edgecolor='black')
# Add counts to on top of the bars
for i, v in enumerate(sub_data['feature'].value_counts()):
    ax.text(
        # v, i, str(v),
        # Adjust the position of the text
        v + 2.5, i-0.1, str(v),
        color='black', 
        ha='left', va='center', 
        rotation=0, fontsize=12, fontweight='bold'
    )
ax.set_xlabel("Number of Annotations", fontsize=12)
ax.set_ylabel("Uniprot Feature Type", fontsize=12)
ax.set_title("Annotations Matched to Single PTMs", fontsize=14, fontweight='bold')
ax.grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
plt.tight_layout()
plt.show()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='proteins_withPTM_peptides_UniprotAnnotation_Bar',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)
print()
print("Modified Residues (MOD_RES) Annotations:")
print((sub_data.loc[sub_data['feature']=='MOD_RES', 'group'].str.split(';').str[0]).value_counts())
print()
print("Cleavage Annotations:")
print((sub_data.loc[sub_data['feature']=='CLEAVAGE', 'group'].str.split(';').str[0]).value_counts())
print()
Number of Proteins with No Annotation: 1239
Number of Proteins with Annotation: 3245







Modified Residues (MOD_RES) Annotations:
group
Phosphorylation            4737
Acetylation                 554
Methylation                 289
Acylation                    44
Hydroxylation                18
Nitration/Nitrosylation       6
ADP-ribosylation              4
Other Modification            4
Name: count, dtype: int64

Cleavage Annotations:
group
Proteolytic Cleavage    1395
Name: count, dtype: int64

There are 4484 peptides classified as PTM-only (dPF = -1) across all proteins. Out of these, 3245 peptides (approximately 72%) have at least one annotation found within the peptide. This means quite a significant portion of PTM-only peptides can be linked to known features, however there are still ~28% of PTM-only peptides that do not overlap with any known annotations, indicating potential novel modifications or limitations in current annotation databases, or simply peptides behaving like technical outliers without biological basis.

Note: There can be multiple peptides as PTMs in a single proteins, thats why the 4484 total doesn't correspond to the number of proteins with PTM-only and Both number of 4216.

Of those 3245 PTM-only peptides with annotations quite a large number of annotations is observed the numbers can be see in the second figure above. With the largest category being the MOD_RES which makes sense as these are post-translational modifications. Other large categories like CROSSLNK, VAR_SEQ, CLEAVAGE also make sense in the context of PTMs.

Note: Another thing is that a peptide can host many annotations, which is why we see very large number of annotations compared to peptides.

Looking closely at the MOD_RES annotations which are directly related to PTMs we see a wide variety of modification types associated with these PTM-only peptides. The most common modifications include Phosphorylation (Phosphoserine, Phosphothreonine, and Phosphotyrosine), which are well-known regulatory PTMs involved in signaling pathways. They are overwhelmingly represented here, indicating that many of the PTM-only peptides may be phosphorylated forms of the canonical sequences.

Additionally I have checked for what are the cleavage annotations, and only proteolytic cleavages are observed (Signal peptide cleavage, Transit peptide cleavage, and Propeptide cleavage). Which indicates that some of these PTM-only peptides may arise from specific proteolytic processing events rather than chemical modifications.


04.2 Modification Type Distribution

Breakdown of specific modification types mapped to identified PTMs. The MOD_RES annotations reveal the chemical nature of post-translational modifications associated with discordant peptide behavior.

# Data preparation
sub_data = final_res[final_res['feature'] != 'No Annotation']
mod_res_data = sub_data[sub_data['feature'] == 'MOD_RES']['group'].str.split(';').str[0].value_counts()

# Calculate percentages
mod_total = mod_res_data.sum()
mod_percentages = (mod_res_data / mod_total * 100)

# Create figure
fig, ax = plt.subplots(figsize=(12, 4))

# Create stacked horizontal bar
left = 0
colors_mod = plt.cm.Set3(np.linspace(0, 1, len(mod_res_data)))

for i, (mod_type, count) in enumerate(mod_res_data.items()):
    width = mod_percentages.iloc[i]
    bar = ax.barh(0, width, left=left, color=colors_mod[i], 
                  edgecolor='white', linewidth=2, height=0.6)
    
    # Add percentage label if there's enough space (>3% threshold)
    if width > 3:
        ax.text(left + width/2, 0, f'{width:.1f}%\n({count:,})', 
               ha='center', va='center', fontsize=11, fontweight='bold',
               bbox=dict(boxstyle="round,pad=0.2", facecolor='white', alpha=0.8))
    
    left += width

# Styling
ax.set_xlim(0, 100)
ax.set_ylim(-0.5, 0.5)
ax.set_title(
    f'Distribution of Modification Types mapped to identified PTMs (n={mod_total:,})', 
    fontsize=16, fontweight='bold', pad=20
)
ax.set_yticks([])
ax.set_xticks([])

# Remove spines for cleaner look
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)

# Add grid for easier reading
ax.grid(axis='x', linestyle='--', alpha=0.3)

# Create legend with counts
legend_elements = [
    plt.Rectangle((0,0),1,1, facecolor=colors_mod[i], label=f'{mod_type}: {count:,} ({mod_percentages.iloc[i]:.1f}%)')
    for i, (mod_type, count) in enumerate(mod_res_data.items())
]

ax.legend(
    handles=legend_elements, 
    loc='center', bbox_to_anchor=(0.5, -0.25), ncol=4, 
    fontsize=10, frameon=False, columnspacing=1.5
)

plt.tight_layout()

# Save the figure
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='mod_res_modifications_stacked_bar',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

print(f"MOD_RES Analysis:")
print(f"Total modifications: {mod_total:,}")
for mod_type, count in mod_res_data.items():
    print(f"• {mod_type}: {count:,} ({count/mod_total*100:.1f}%)")
MOD_RES Analysis:
Total modifications: 5,656
• Phosphorylation: 4,737 (83.8%)
• Acetylation: 554 (9.8%)
• Methylation: 289 (5.1%)
• Acylation: 44 (0.8%)
• Hydroxylation: 18 (0.3%)
• Nitration/Nitrosylation: 6 (0.1%)
• ADP-ribosylation: 4 (0.1%)
• Other Modification: 4 (0.1%)

Above is the detailed distribution of specific MOD_RES modification types observed among the PTM-only peptides. Phosphorylation events (Phosphoserine, Phosphothreonine, Phosphotyrosine) dominate the landscape, highlighting their prevalence in regulatory processes. Other modifications such as Acetylation, Methylation are with small but notable counts, indicating a diversity of PTM types contributing to peptide discordance.


04.3 Differential Proteoform (dPF) Analysis

Analyze peptides classified as differential proteoforms (dPF > 0), examining annotation coverage and categorizing reasons for proteoform identification based on manual review results.

# Smartly build dPF color palette for all data
dPF_colors = {
    # PTM-Only
    -1: "#a7c957",  # PTM Only
    # Canonical
    0: "#c3c3c3",  # Canonical
    # Proteoforms
    # Calculate reds (#bc4749) for dPF > 0
    **({} if (len(summary_data) == 0 or int(summary_data['dPF'].max()) <= 0) 
        else {i: sns.color_palette("Reds_r", n_colors=int(summary_data['dPF'].max())).as_hex()[i-1] for i in range(1, int(summary_data['dPF'].max()) + 1)})
}
print(f"Proteoform Group Colors: {dPF_colors}")
sns.palplot(list(dPF_colors.values()))
Proteoform Group Colors: {-1: '#a7c957', 0: '#c3c3c3', 1: '#bc141a', 2: '#f14432', 3: '#fc8a6a', 4: '#fdcab5'}
protein_with_dpf = protein_dpf_analysis.loc[
    (protein_dpf_analysis['n_forms'] > 0),
    'Protein'
].copy()

dPF_uniprot_maps = peptide_uniprot_map[
    (peptide_uniprot_map['dPF'] > 0)
].copy()
print(f"Number of Proteins with dPFs: {len(protein_with_dpf):,}")
print(f"Number of generated mappings: {len(dPF_uniprot_maps):,}")
print(f"Number of dPF Annotations found: {len(dPF_uniprot_maps[dPF_uniprot_maps['feature'] != 'No Annotation']):,}")
# dPF_uniprot_maps.head(20)

# protein_dpf_analysis.sort_values('dPF_pct', ascending=False).head(30)
Number of Proteins with dPFs: 1,190
Number of generated mappings: 54,051
Number of dPF Annotations found: 50,937
# Manual Check Results for dPFs
# Reasons Categorized
high_level_dict = {
    'No Annotation': 192,
    'Mixed/Unclear Reasons': 857,
    'Clear Reasons': 405,
}
# Categories of Clear Reasons
clear_reasons_dict = {
    'Missed Cleavages': 193,
    'Isoforms': 51,
    'Multiple PTMs': 37,
    'N-terminal': 34,
    'C-terminal': 31,
    'Variants': 20,
    'Mutagen': 20,
    'Cleavage': 19,
}

# Number of dPFs 
dPF_counts = protein_dpf_analysis['n_forms'].value_counts().sort_index()
print("dPF Counts per Protein:")
for n_forms, count in dPF_counts.items():
    print(f"  {n_forms}: {count:,}")

print("Number of Unique dPFs > 0:", protein_dpf_analysis[protein_dpf_analysis['n_forms'] > 0]['n_forms'].sum())
print()

print("Manual Check Results for dPFs:")
for reason, count in high_level_dict.items():
    print(f"  {reason}: {count:,}")

# Donut chart for dPF annotation reasons
labels = list(high_level_dict.keys())
counts = list(high_level_dict.values())
colors = ["#e5e5e5",  '#fdcab5', "#bc4749"]
fig, ax = plt.subplots(figsize=(8, 6))
wedges, texts = ax.pie(
    counts,
    labels=None,             
    startangle=140,
    colors=colors,
    wedgeprops=dict(width=0.60, edgecolor='w'),
    explode=[0.02 if c < 0.05 * sum(counts) else 0.0 for c in counts],
    counterclock=False
)
# Add percent labels placed on wedges
bbox_props = dict(boxstyle="round,pad=0.3", fc="white", ec="none", alpha=0.8)
for w, p in zip(wedges, np.array(counts) / sum(counts)):
    angle = (w.theta2 + w.theta1) / 2.0
    x, y = w.r * 0.75 * np.cos(np.deg2rad(angle)), w.r * 0.75 * np.sin(np.deg2rad(angle))
    ax.text(x, y, f"{p:.1%}", ha='center', va='center', fontsize=15, bbox=bbox_props, fontweight='bold')
# Center annotation: total proteins
ax.text(0, 0, f"Total\n{sum(counts)}", ha='center', va='center', fontsize=15, fontweight='bold')
# Legend with counts and percentages
legend_labels = [f"{lab}: {cnt:,} ({pct:.1%})" for lab, cnt, pct in zip(labels, counts, np.array(counts) / sum(counts))]
ax.legend(
    wedges, legend_labels,
    title="dPF Annotation Reasons",
    loc="upper center",
    bbox_to_anchor=(0.5, -0.12),
    ncol=len(labels),
    fontsize=11,
    title_fontsize=12,
    frameon=False, 
    handletextpad=0.5,  
    columnspacing=1.5,
)
ax.axis('equal')
plt.title('Distribution of dPF Annotation Reasons', fontsize=16, fontweight='bold')
plt.tight_layout()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='dPF_annotation_reasons_Donut',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)

# Create a horizontal bar chart for Clear Reasons
clear_reasons = list(clear_reasons_dict.keys())
clear_counts = list(clear_reasons_dict.values())
clear_color = "#bc4749"
fig, ax = plt.subplots(figsize=(5, 5))
bars = ax.barh(clear_reasons, clear_counts, color=clear_color, edgecolor='black')
# Add counts to on top of the bars
for i, v in enumerate(clear_counts):
    ax.text(
        # v, i, str(v),
        # Adjust the position of the text
        v + 2.5, i-0.1, str(v),
        color='black', 
        ha='left', va='center', 
        rotation=0, fontsize=12, fontweight='bold'
    )
ax.set_xlabel("Number of dPFs", fontsize=12)
ax.set_ylabel("Clear Reason Categories", fontsize=12)
ax.set_title("Breakdown of Clear Reasons for dPF Annotations", fontsize=14, fontweight='bold')
ax.grid("both", linestyle="--", linewidth=0.75, alpha=0.5, color="lightgrey")
plt.tight_layout()
plots.finalize_plot( 
    fig, show=True, save=save_to_folder, 
    filename='dPF_annotation_clear_reasons_Bar',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)
dPF Counts per Protein:
  0: 5,971
  1: 951
  2: 216
  3: 21
  4: 2
Number of Unique dPFs > 0: 1454

Manual Check Results for dPFs:
  No Annotation: 192
  Mixed/Unclear Reasons: 857
  Clear Reasons: 405

Due to the complexity of dPF peptides and the variety of potential reasons for their discordant behavior, a manual review was conducted to categorize the underlying causes. I was able to categorize them into three groups: No Annotation Found, Clear Reasons (annotations that clearly explain the dPF behavior), and Mixed/Unclear Reasons (annotations present but not obviously explaining the discordance or many different annotations making it hard to pinpoint a single cause).

While the automated annotation mapping provided a wealth of information, the diversity of annotations and their interpretations necessitated manual curation to accurately classify the dPF peptides. This manual review helped to clarify the biological relevance of the observed proteoform discordance.

About 28% (405) do have clear reasons for their discordant behavior based on the annotations present. These include well-characterized PTMs, alternative splicing events, or known functional domains that could explain the differential regulation observed. Howver majority 59% don't have well separated reasons, there are annotations present but they either don't clearly explain the discordance or there are many different annotations making it hard to pinpoint a single cause.

Upon examining the annotations associated with dPF peptides, several common themes emerged among those with clear reasons for their discordant behavior. Many dPF peptides were linked: Missed cleavages which can be due to incomplete enzymatic digestion during sample preparation, leading to unexpected peptide forms; Alternative splicing events (mostly isoforms) resulting in proteoforms with distinct sequences and functional properties; Post-translational modifications (PTMs) such as phosphorylation, acetylation, or ubiquitination that can alter peptide behavior and detection; Functional domains or motifs known to influence protein activity, localization, or interactions, which may contribute to differential regulation. There are also cleavage, neo-N/C terminal events.

Note: The manual categorization process is inherently subjective and may vary between reviewers. Future work could focus on developing more objective criteria or automated methods for classifying dPF peptides based on annotation data.

Future Directions: With a web-based interface and interactive analysis and visualization tools, this process can be streamlined and left to the biologist, or experts to examine based on their domain knowledge rather than relying on a single reviewer. Additionally having access to more centralized or comprehensive annotations or even proteoforms themselves could make this step more automated and less subjective.


04.4 Consecutive Peptide Validation

Identify proteins where all positive dPF peptides are consecutive along the protein sequence. Consecutive dPF peptides suggest true proteoform regions rather than scattered technical artifacts.

tmp = summary_data.set_index('Protein').loc[protein_with_dpf].copy()

candidate_proteins = []

# Build a vectorized table for positive dPF groups across all proteins
df_pos = tmp.reset_index()[['Protein', 'dPF', 'PeptideID']].copy()
# Keep only positive dPFs
df_pos = df_pos[df_pos['dPF'] > 0].copy()

# Ensure PeptideID numeric and drop invalid rows
df_pos['PeptideID'] = pd.to_numeric(df_pos['PeptideID'], errors='coerce')
df_pos = df_pos.dropna(subset=['PeptideID'])
if df_pos.empty:
    if verbose:
        print("No positive dPFs found in the selected proteins.")
else:
    # Aggregate per (Protein, dPF)
    agg = (
        df_pos
        .groupby(['Protein', 'dPF'], observed=True)
        .PeptideID
        .agg(min_id='min', max_id='max', n_unique=lambda x: x.nunique())
        .reset_index()
    )
    # Compute consecutiveness: n_unique == (max_id - min_id + 1)
    agg['consecutive'] = agg['n_unique'] == (agg['max_id'] - agg['min_id'] + 1)

    # For each protein, require ALL positive-dPF groups to be consecutive
    per_protein_ok = agg.groupby('Protein', observed=True)['consecutive'].all()

    # Candidate proteins are those with True
    candidate_proteins = per_protein_ok[per_protein_ok].index.tolist()

    # # Optional verbose reporting (summarize kept / rejected)
    # if verbose:
    #     kept = per_protein_ok[per_protein_ok].index.tolist()
    #     rejected = per_protein_ok[~per_protein_ok].index.tolist()
    #     print(f"KEPT count: {len(kept):,}; REJECTED count: {len(rejected):,}")

    #     # Print examples (up to 10) with details to aid inspection
    #     for p in kept[:10]:
    #         rows = agg[agg['Protein'] == p].set_index('dPF')[['min_id','max_id','n_unique','consecutive']]
    #         print(f"KEPT: {p} -> all dPFs consecutive: {rows.to_dict(orient='index')}")

    #     for p in rejected[:10]:
    #         rows = agg[agg['Protein'] == p].set_index('dPF')[['min_id','max_id','n_unique','consecutive']]
    #         failed = rows[~rows['consecutive']]
    #         print(f"REJECT: {p} -> non-consecutive dPFs: {failed.index.tolist()}; details:")
    #         print(failed)

print(f"Number of Candidate Proteins with consistent (non-gap) dPFs: {len(candidate_proteins):,}")
print(f"Candidate Proteins: {candidate_proteins}")
Number of Candidate Proteins with consistent (non-gap) dPFs: 84
Candidate Proteins: ['A6NHQ2', 'O00299', 'O00391', 'O00425', 'O14772', 'O14939', 'O43242', 'O43663', 'O60266', 'O60294', 'O60343', 'O60502', 'O60524', 'O75044', 'O75496', 'O95817', 'P06702', 'P07951', 'P11274', 'P23469', 'P25054', 'P27540', 'P31146', 'P40227', 'P51116', 'P52306', 'P52735', 'P53804', 'P54198', 'P60228', 'P60983', 'P61106', 'P62701', 'P78386', 'Q01518', 'Q07890', 'Q12851', 'Q13131', 'Q13228', 'Q13618', 'Q14114', 'Q14376', 'Q14586', 'Q15125', 'Q15393', 'Q2M2I8', 'Q3SY52', 'Q5SSJ5', 'Q6UWP8', 'Q6ZRQ5', 'Q7L1Q6', 'Q7L576', 'Q86YC2', 'Q8IY33', 'Q8N1F7', 'Q8NBU5', 'Q8NHQ8', 'Q8TB69', 'Q8WUF5', 'Q96BT7', 'Q96JC4', 'Q96S21', 'Q99459', 'Q9BQ39', 'Q9BQG2', 'Q9BTE7', 'Q9BW04', 'Q9BZL6', 'Q9H0U4', 'Q9H223', 'Q9H9T3', 'Q9HCP0', 'Q9NPQ8', 'Q9NYK5', 'Q9NZI7', 'Q9NZI8', 'Q9NZJ0', 'Q9NZN5', 'Q9UG01', 'Q9UJW7', 'Q9UPZ3', 'Q9UQN3', 'Q9Y6I4', 'Q9Y6Y8']

This is one category that looked promising and interesting to look at. The idea here is that if all the dPF peptides for a given protein are consecutive along the protein sequence, it strongly suggests that these peptides represent a true proteoform region rather than scattered technical artifacts. Consecutive peptides indicate a coherent segment of the protein that may be differentially regulated or modified, supporting the biological relevance of the observed proteoform discordance.

Note: I have used these to start the manual checkups, as well as the interesting and biologically relevant proteins that I know of from the hypoxia experiment.


05. Case Studies and Biological Validation

Based on the hypoxia experiment detailed in the paper, the following genes and proteins are of particular interest. The experiment exposed H358 lung cancer cells to hypoxic conditions (1% oxygen) and analyzed changes in their proteome. Key enzymes related to glycolysis and the glyoxalase system showed altered abundance, with an inverse correlation between hypoxic markers (LDHA, HK2, GAPDH upregulation) and GLO2 downregulation.

Key Proteins of Interest:

  • LDHA (P00338): Lactate production, upregulated in hypoxia
  • HK2 (P52790): Glycolysis entry, upregulated in hypoxia
  • GAPDH (P04406): Glycolysis, upregulated in hypoxia
  • HAGH/GLO2 (Q16775): Methylglyoxal metabolism, downregulated in hypoxia
  • GLO1 (Q04760): Methylglyoxal metabolism, unaffected
glycolytic_up_by_HIF_1 = [
    'P52789', #HK2                  # PTM
    'P08237', #PFKM (Muscle Type)   # No Signf.
    'P17858', #PFKL (Liver Type)    # PTM
    'Q01813', #PFKP (Platelet Type) # PTM
    'P14618', #PKM                  # dPF
    'P00338', #LDHA                 # PTM
    'P04406', #GAPDH                # dPF * Picked
    'Q04760', #GLO1                 # dPF * Picked 
    'Q16775', #HAGH (GLO2)          # PTM
]

# utils.view_table(
    
#     # protein_dpf_analysis[protein_dpf_analysis['Protein'].isin(candidate_proteins)].sort_values('Coverage', ascending=False), 
    
#     # Forms with Canon Peptides
#     # protein_dpf_analysis[
#     #     (protein_dpf_analysis['n_forms'] > 0) & 
#     #     (protein_dpf_analysis['n_canon'] > 0) &
#     #     (protein_dpf_analysis['n_dPF'] > 2)
#     # ].sort_values('n_dPF', ascending=False),
    
#     # # # DPF without Gaps
#     # protein_dpf_analysis[
#     #     (protein_dpf_analysis['Protein'].isin(candidate_proteins)) 
#     #     # (protein_dpf_analysis['dPF_pct'] > 45)
#     # ].sort_values('n_dPF', ascending=False),

#     # # Only PTMs
#     # protein_dpf_analysis[
#     #     (protein_dpf_analysis['dpf_cat'] == 'PTM Only')
#     # ].sort_values('Length', ascending=True),
#     page_number=9,
#     page_size=10,
# )

05.1 Detailed Peptide Visualization

Generate detailed peptide-level visualization showing intensity patterns, significance, and cluster assignments for selected proteins of interest. This enables in-depth examination of proteoform behavior under hypoxic conditions.

is_demo = False # False if want to save figures
current_protein = 'Q07890'
cur_gene = protein_dpf_analysis.loc[
    protein_dpf_analysis['Protein'] == current_protein, 'Gene'
].values[0]
print(f"Current Protein: {current_protein} ({cur_gene})")
plots.detailed_peptide_with_clusters(
    data=test_data,
    cur_protein=current_protein,
    pThr=pThr,
    pvalue_col='adj_pval',
    cluster_col='ClusterID',
    rawIntensity_col='log10Intensity',
    adjIntensity_col='AdjIntensity',
    condition_palette=condition_colors,
    
    save= not is_demo,
    show=True,
    filename=f'detailed_peptide_look_{cur_gene}',
    filepath=figure_path,
    fileformats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)
Current Protein: Q07890 (SOS2)

05.2 Protein-Level Overview with Annotations

Create comprehensive protein-level visualization integrating UniProt annotations with peptide coverage. Multiple annotation layers show sequence features, modifications, and functional domains alongside identified peptides colored by dPF category.

# Enhanced Multi-Layered Protein Visualization with UniProt Annotations
print("🎨 CREATING ENHANCED PROTEIN VISUALIZATION WITH UNIPROT ANNOTATIONS")

# Get UniProt annotations for current protein using expanded data
current_uniprot = uniprot_data[uniprot_data['Protein'] == current_protein].copy()

# Get detailed peptide data with intensity information
current_detailed = test_data[test_data['Protein'] == current_protein].copy()

# Get summary data for current protein (contains dPF and trace info)
current_summary = summary_data[summary_data['Protein'] == current_protein].copy()

# Feature category colors for UniProt annotations
feature_category_colors = {
    'Sequence Heterogeneity & Isoforms': '#e74c3c',
    'Co- & Post-Translational Modifications': '#3498db', 
    'Protein Processing & Maturation': '#2ecc71',
    'Structure, Topology & Sequence Characteristics': '#f39c12',
    'Functional Domains, Regions & Sites': '#9b59b6'
}

# Define FIXED ordering for categories and features for consistency across all proteins
CATEGORY_ORDER = [
    'Sequence Heterogeneity & Isoforms',
    'Co- & Post-Translational Modifications', 
    'Protein Processing & Maturation',
    'Structure, Topology & Sequence Characteristics',
    'Functional Domains, Regions & Sites'
]

# Get protein information from current_summary
cur_seqLength = current_summary['Length'].iloc[0]
cur_protein = current_summary['Protein'].iloc[0]
cur_description = current_summary['Description'].iloc[0]
cur_geneName = current_summary['Gene'].iloc[0]
coverage_pct = current_summary['Coverage'].iloc[0] if 'Coverage' in current_summary.columns else 0
total_peptides = current_summary['PeptideID'].nunique()

# Calculate dynamic figure width based on protein length
protein_length = cur_seqLength
if protein_length <= 300:
    fig_width = 10
elif protein_length <= 600:
    fig_width = 15
elif protein_length <= 1000:
    fig_width = 20
else:
    fig_width = 25

# Calculate figure dimensions based on feature categories and max traces
max_traces = current_summary['trace'].max()
relevant_categories = []
category_feature_counts = {}
if len(current_uniprot) > 0:
    # Use FIXED category order - only include categories that have data
    available_categories = current_uniprot['feature_category'].unique().tolist()
    relevant_categories = [cat for cat in CATEGORY_ORDER if cat in available_categories]
    
    # Calculate number of features per category for dynamic height adjustment
    for category in relevant_categories:
        all_features_in_category = []
        for feat_cat, features in annotate.feature_categories.items():
            if feat_cat == category:
                all_features_in_category = features
                break
        category_feature_counts[category] = len(all_features_in_category)
else:
    relevant_categories = []

n_feature_categories = len(relevant_categories)

# Calculate heights based on feature counts in each category
category_heights = []
for category in relevant_categories:
    feature_count = category_feature_counts.get(category, 1)
    # Base height + extra height based on number of features
    height = 1.0 + (feature_count * 0.15)  # Scale height with feature count
    category_heights.append(height)

# Create enhanced figure with dynamic width and heights
base_height = 1  # Reduced height for peptides & protein subplot
total_category_height = sum(category_heights)
fig_height = total_category_height + base_height + 5  # Categories + peptides + margins

# Create subplot with custom height ratios and dynamic width
height_ratios = category_heights + [base_height]  # Categories + peptides subplot
n_subplots = n_feature_categories + 1
fig, axes = plt.subplots(nrows=n_subplots, ncols=1, figsize=(10, fig_height),
                        gridspec_kw={'height_ratios': height_ratios})

# Ensure axes is always a list/array
if n_subplots == 1:
    axes = [axes]

protein_length = cur_seqLength

# Set consistent x-axis ticks for all plots
step = max(1, protein_length // 20)  # Approximately 20 tick marks
positions = range(0, protein_length + 1, step)

# =============================================================================
# FEATURE CATEGORY LAYERS: UniProt Annotations by Category
# =============================================================================
for cat_idx, category in enumerate(relevant_categories):
    ax_cat = axes[cat_idx]
    ax_cat.set_xlim(0, protein_length)
    
    # Get annotations for this category
    cat_annotations = current_uniprot[current_uniprot['feature_category'] == category].copy()
    
    # Get ALL possible features for this category for completeness (FIXED ORDER)
    all_features_in_category = []
    for feat_cat, features in annotate.feature_categories.items():
        if feat_cat == category:
            all_features_in_category = features
            break
    
    # Create y-position mapping for ALL features in category (FIXED ORDER - whether present or not)
    # Sort features alphabetically for consistent ordering across proteins
    all_features_in_category_sorted = sorted(all_features_in_category)
    y_positions = {feat: i for i, feat in enumerate(all_features_in_category_sorted)}
    y_max = len(all_features_in_category_sorted)
    ax_cat.set_ylim(-0.5, y_max + 0.5)
    
    category_color = feature_category_colors.get(category, '#7f8c8d')
    
    if len(cat_annotations) > 0:
        # Filter to only relevant features if available
        cat_annotations = cat_annotations[cat_annotations['feature'].isin(all_features_in_category_sorted)]
        
        # Plot each annotation
        for _, annotation in cat_annotations.iterrows():
            start, end = annotation['start'], annotation['end']
            feature_type = annotation['feature']
            cleaned_note = annotation['note'] if pd.notna(annotation['note']) else ''
            
            if feature_type in y_positions:  # Skip CHAIN as it covers the whole protein
                y_pos = y_positions[feature_type]
                
                # Handle single-position features (start == end)
                if start == end:
                    # Draw as a vertical line or small marker
                    ax_cat.axvline(x=start, ymin=(y_pos-0.3)/(y_max+1), ymax=(y_pos+0.3)/(y_max+1), 
                                  color=category_color, linewidth=4, alpha=0.8)
                    # Add a small circle marker
                    ax_cat.scatter(start, y_pos, s=80, c=category_color, alpha=0.8, 
                                  edgecolors='white', linewidth=1, zorder=5)
                else:
                    # Draw annotation bar for ranges
                    ax_cat.barh(y_pos, end - start, left=start, height=0.7, 
                               color=category_color, alpha=0.8, edgecolor='white', linewidth=1)
                
                # Add annotation label with cleaned note
                feature_width = end - start if end != start else protein_length * 0.02
                if feature_width > protein_length * 0.03:  # Only label if feature is large enough or single position
                    label_text = f"{feature_type}"
                    if cleaned_note and len(cleaned_note) > 0:
                        label_text += f"\n{cleaned_note[:20]}{'...' if len(cleaned_note) > 20 else ''}"
                    
                    x_pos = start if start == end else start + (end - start)/2
                    ax_cat.text(x_pos, y_pos, label_text, 
                              ha='center', va='center', fontsize=8, fontweight='bold',
                              bbox=dict(boxstyle='round,pad=0.2', facecolor='white', alpha=0.8))
    
    # Set y-axis labels for ALL features in category (FIXED ORDER for completeness)
    ax_cat.set_yticks(list(y_positions.values()))
    ax_cat.set_yticklabels(list(y_positions.keys()), fontsize=9, fontweight='bold')
    
    # Add category name in upper left area of subplot
    ax_cat.text(0.02, 0.95, category, transform=ax_cat.transAxes, 
                fontsize=12, fontweight='bold', color=category_color,
                bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8, edgecolor=category_color),
                verticalalignment='top', horizontalalignment='left')
    
    # Remove y-axis label since we're putting category name in upper left
    ax_cat.set_ylabel('')
    ax_cat.set_xlabel('')
    
    # Enhanced grid visibility (both x and y grids)
    ax_cat.grid(True, alpha=0.4, axis='both', linestyle='-', linewidth=0.5)
    ax_cat.set_axisbelow(True)  # Put grid behind data
    
    # Set consistent x-axis ticks
    ax_cat.set_xticks(positions)
    ax_cat.set_xticklabels([])  # No labels on intermediate plots

# =============================================================================
# BOTTOM LAYER: Combined Protein Rectangle + Peptides + Position Axis
# =============================================================================
ax_combined = axes[-1]
ax_combined.set_xlim(0, protein_length)

# Calculate y-limits with minimal gaps to show overlaps clearly
y_min = -1.8  # More space for larger protein rectangle
y_max = max_traces + 0.3  # Minimal top margin for tighter layout
ax_combined.set_ylim(y_min, y_max)

# Draw larger protein rectangle with comprehensive information
protein_height = 1.0  # Increased height for more text
ax_combined.barh(-0.9, protein_length, height=protein_height, color='#34495e', alpha=0.8, edgecolor='white', linewidth=2)

# Add comprehensive protein information inside the rectangle
protein_text = f'{cur_geneName} ({cur_protein}) - {cur_description[:40]}{"..." if len(cur_description) > 40 else ""}\n{protein_length} AA | Coverage: {coverage_pct:.1f}%'
ax_combined.text(protein_length/2, -0.9, protein_text, 
                ha='center', va='center', fontsize=10, fontweight='bold', color='white',
                bbox=dict(boxstyle='round,pad=0.3', facecolor='#34495e', alpha=0.9, edgecolor='white'))

# Plot peptides with minimal gaps to visualize overlaps
for _, peptide in current_summary.iterrows():
    start, end = peptide['peptide_start'], peptide['peptide_end']
    trace = peptide['trace']
    dpf = peptide['dPF']
    is_sig = peptide['isSignificant']
    peptide_id = peptide['PeptideID']
    
    # Get color based on dPF category using provided dPF_colors
    color = dPF_colors.get(dpf, '#7f8c8d')
    
    # Adjust height and alpha based on significance - use full height for better overlap visualization
    height = 0.85 if is_sig else 0.7  # Increased height to show overlaps better
    alpha = 0.9 if is_sig else 0.7
    
    # Position peptides with minimal gaps - use precise trace positions
    y_position = trace * 0.9  # Reduced spacing between traces to show overlaps
    
    # Draw peptide bar
    ax_combined.barh(y_position, end - start, left=start, height=height, 
                    color=color, alpha=alpha, edgecolor='white', linewidth=1)
    
    # Add peptide label with significance indicator
    label = f"{peptide_id}{'*' if is_sig else ''}"
    font_weight = 'bold' if is_sig else 'normal'
    
    ax_combined.text(start + (end - start)/2, y_position, label, 
                    ha='center', va='center', fontsize=9, fontweight=font_weight, color='black')

# Create legend for dPF categories using provided colors
legend_elements = [
    plt.Rectangle((0,0),1,1, facecolor=color, alpha=0.8, label=f'dPF={dpf}' if dpf >= 0 else 'PTM') 
        for dpf, color in dPF_colors.items() if dpf in current_summary['dPF'].values
]
ax_combined.legend(handles=legend_elements, loc='upper right', fontsize=9, title='dPF Categories')

# Simplified y-axis with minimal labels
y_tick_positions = [-0.9] + [i * 0.9 for i in range(max_traces + 1)]
ax_combined.set_yticks(y_tick_positions)
# Only label protein, minimal trace indicators
y_labels = ['Protein'] + ['' for _ in range(max_traces + 1)]  # Empty labels for cleaner look
ax_combined.set_yticklabels(y_labels, fontsize=9, fontweight='bold')

ax_combined.set_xlabel('Protein Position (AA)', fontsize=12, fontweight='bold')
# Enhanced grid visibility (both x and y grids)
ax_combined.grid(True, alpha=0.4, axis='both', linestyle='-', linewidth=0.5)
ax_combined.set_axisbelow(True)  # Put grid behind data
ax_combined.set_ylabel('Peptides & Protein', fontsize=12, fontweight='bold', rotation=90)

# Set consistent x-axis ticks with labels
ax_combined.set_xticks(positions)
ax_combined.set_xticklabels(positions, fontsize=10)


# Enhanced suptitle with detailed protein information
fig.suptitle(
    f'{cur_geneName} ({cur_protein}) - {cur_description}\n'
    f'Length: {cur_seqLength} AA | Coverage: {coverage_pct:.1f}% | Peptides: {total_peptides}',
    fontsize=16, fontweight='bold', y=0.98
)

# Use subplots_adjust with very tight spacing
plt.subplots_adjust(left=0.15, right=0.95, top=0.94, bottom=0.08, hspace=0.05)
plots.finalize_plot( 
    fig, show=True, save=not is_demo, 
    filename=f'protein_overview_{cur_geneName}',
    filepath=figure_path,
    formats=figure_formats,
    transparent=transparent_bg,
    dpi=figure_dpi
)
🎨 CREATING ENHANCED PROTEIN VISUALIZATION WITH UNIPROT ANNOTATIONS
peptide_uniprot_map[
    (peptide_uniprot_map['Protein']==current_protein) & # Current Protein
    (peptide_uniprot_map['PeptideID'].isin([1, 2])) # Peptide to inspect
]
Protein PeptideID pepStart pepEnd feature_category feature startpos endpos group agent note score dPF
111777 Q07890 1 85 96 Co- & Post-Translational Modifications CROSSLNK 96 96 Ubiquitination/SUMOylation Ubiquitination/SUMOylation at K96 1.0000 1
111778 Q07890 2 243 256 No Annotation -1 -1 0.0000 1
uniprot_data[
    (uniprot_data['Protein'] == current_protein) & 
    (uniprot_data['feature'].isin(['CLEAVAGE']))
]
isoform Protein feature start end group agent note description score is_relevant feature_category
341919 Q07890 CLEAVAGE 526 526 Proteolytic Cleavage peptidyl-Lys metallopeptidase N-Ph cleavage at A526 by peptidyl-Lys metallop... MEROPS Cleavage Site; Protease: peptidyl-Lys m... 2 True Protein Processing & Maturation
341921 Q07890 CLEAVAGE 557 557 Proteolytic Cleavage peptidyl-Lys metallopeptidase N-Ph cleavage at L557 by peptidyl-Lys metallop... MEROPS Cleavage Site; Protease: peptidyl-Lys m... 2 True Protein Processing & Maturation
potential_figure_candidates = [
    'O14772', # FPGT, Nice looking dPFs with likely isoform
    'Q9BW04', # SARG, Biologically not interesting, but has two distinct dPFs
    'Q96JC4', # ZNF479, Downregulated in 72hr hypoxia, single dPFs with phospho in repeat peptides.
    'P07951', # TPM2, Only Normoxia 48hr upregulated,...
    'O60294', # LCMT2, small 2 peptide dPF, and PTM at 10th peptide
    'O43242', # PSMD3, small 2 peptide dPF (hypoxia-72hr down), and PTM (Normoxia-72hr down)
    'O60343', # TBCD4, Biologically relevant, (hypoxia-72hr down), with PTM (Normoxia-72hr down)
    'Q07890', # SOS2, Biologically relevant, Prior to cleavage site (hypoxia-72hr down)
]
print("Notebook Execution Time:", utils.prettyTimer(utils.getTime() - startTime))
Notebook Execution Time: 00h:00m:16s